赞美

PRAISE FOR

机器学习的原因

Why Machines Learn

“一些关于神经网络发展的书描述了底层数学,而另一些则描述了社会历史。这本书在社会历史的背景下介绍了数学。这是一部杰作。作者非常善于以一种让对数学只有基本了解的人都能理解的方式解释数学,但他也是一位非常优秀的作家,让社会历史栩栩如生。”

“Some books about the development of neural networks describe the underlying mathematics while others describe the social history. This book presents the mathematics in the context of the social history. It is a masterpiece. The author is very good at explaining the mathematics in a way that makes it available to people with only a rudimentary knowledge of the field, but he is also a very good writer who brings the social history to life.”

— 深度学习先驱、图灵奖获得者、谷歌前副总裁、多伦多大学名誉教授 GEOFFREY HINTON

—GEOFFREY HINTON, deep learning pioneer, Turing Award winner, former VP at Google, and professor emeritus at the University of Toronto

“阅读《为什么机器学习》几分钟后,您就会感觉到自己的突触权重得到更新。最后,您将实现自己的深度学习版本 — 一路上收获深厚的快乐和洞察力。”

“After just a few minutes of reading Why Machines Learn, you’ll feel your own synaptic weights getting updated. By the end you will have achieved your own version of deep learning—with deep pleasure and insight along the way.”

—史蒂文·斯特罗加茨,《纽约时报》畅销书《无限力量》的作者、康奈尔大学数学教授

—STEVEN STROGATZ, New York Times bestselling author of Infinite Powers and professor of mathematics at Cornell University

“如果您正在寻找一种方法来理解正在进行的人工智能革命,那么不要再找了。在这本全面而引人入胜的书中,Anil Ananthaswamy 将所有内容都放在背景中,从想法的起源及其控制方程,到它改变医学、量子物理学以及我们生活的几乎每个方面的潜力。这是一本了解人工智能可能性和局限性的必读书籍。”

“If you were looking for a way to make sense of the AI revolution that is well under way, look no further. With this comprehensive yet engaging book, Anil Ananthaswamy puts it all into context, from the origin of the idea and its governing equations to its potential to transform medicine, quantum physics—and virtually every aspect of our life. An essential read for understanding both the possibilities and limitations of artificial intelligence.”

—萨宾·霍森费尔德,物理学家,《纽约时报》畅销书《存在主义物理学:科学家解答人生最大问题的指南》的作者

—SABINE HOSSENFELDER, physicist and New York Times bestselling author of Existential Physics: A Scientist’s Guide to Life’s Biggest Questions

“《为什么机器会学习》是一部杰作,以清晰、易懂和有趣的方式解释了现代机器学习背后的数学原理,以及该领域和先驱研究人员的丰富多彩的历史。随着人工智能对我们的世界产生越来越深远的影响,这本书将成为任何想要深入了解这些通常难以理解的机器内部情况的人的宝贵伴侣。”

Why Machines Learn is a masterful work that explains—in clear, accessible, and entertaining fashion—the mathematics underlying modern machine learning, along with the colorful history of the field and its pioneering researchers. As AI has increasingly profound impacts in our world, this book will be an invaluable companion for anyone who wants a deep understanding of what’s under the hood of these often inscrutable machines.”

—梅兰妮·米切尔, 《人工智能》作者、圣达菲研究所教授

—MELANIE MITCHELL, author of Artificial Intelligence and professor at the Santa Fe Institute

“生成式人工智能以机器学习为基础,其基础性进步不亚于微处理器、互联网和手机的发明。但除了少数专家外,几乎没有人了解其工作原理。Anil Ananthaswamy 为我们揭开了这一神秘面纱,为我们介绍了这一革命性发展所依赖的数学原理,既温和又直观,而且以人为本。”

“Generative AI, with its foundations in machine learning, is as fundamental an advance as the creation of the microprocessor, the internet, and the mobile phone. But almost no one, outside of a handful of specialists, understands how it works. Anil Ananthaswamy has removed the mystery by giving us a gentle, intuitive, and human-oriented introduction to the math that underpins this revolutionary development.”

—PETER E. HART,人工智能先驱、企业家、 《模式分类》合著者

—PETER E. HART, AI pioneer, entrepreneur, and co-author of Pattern Classification

“Anil Ananthaswamy 的《机器为何学习》开启了一段令人振奋的旅程,带读者探索当代机器学习的起源。本书以引人入胜的叙述方式,深入探讨了推动人工智能革命的有影响力人物的生活,同时探索了支撑人工智能的复杂数学形式。在 Anil 追溯现代人工智能的根源并揭开其奥秘的同时,他温和地介绍了底层数学,使复杂的主题对所有背景的读者来说都易于理解和令人兴奋。”

“Anil Ananthaswamy’s Why Machines Learn embarks on an exhilarating journey through the origins of contemporary machine learning. With a captivating narrative, the book delves into the lives of influential figures driving the AI revolution while simultaneously exploring the intricate mathematical formalism that underpins it. As Anil traces the roots and unravels the mysteries of modern AI, he gently introduces the underlying mathematics, rendering the complex subject matter accessible and exciting for readers of all backgrounds.”

—BJÖRN OMMER,慕尼黑大学教授,Stable Diffusion 团队负责人

—BJÖRN OMMER, professor at the Ludwig Maximilian University of Munich and leader of the original team behind Stable Diffusion

ANIL ANANTHASWAMY 的其他作品

ALSO BY ANIL ANANTHASWAMY

同时通过两扇门
Through Two Doors at Once
缺席的人
The Man Who Wasn’t There
物理学的前沿
The Edge of Physics
使用面向对象设计和 C++ 进行数据通信
Data Communications Using Object-Oriented Design and C++

企鹅兰登书屋有限责任公司版权所有

An imprint of Penguin Random House LLC

penguinrandomhouse.com

penguinrandomhouse.com

版权所有 © 2024 Anil Ananthaswamy

Copyright © 2024 by Anil Ananthaswamy

企鹅兰登书屋支持版权。版权可以激发创造力、鼓励多元化声音、促进言论自由并创造充满活力的文化。感谢您购买本书的授权版本,并遵守版权法,未经许可不以任何形式复制、扫描或分发本书的任何部分。您是在支持作家,也让企鹅兰登书屋能够继续为每位读者出版书籍。

Penguin Random House supports copyright. Copyright fuels creativity, encourages diverse voices, promotes free speech, and creates a vibrant culture. Thank you for buying an authorized edition of this book and for complying with copyright laws by not reproducing, scanning, or distributing any part of it in any form without permission. You are supporting writers and allowing Penguin Random House to continue to publish books for every reader.

DUTTON 和 D 版权页是 Penguin Random House LLC 的注册商标。

DUTTON and the D colophon are registered trademarks of Penguin Random House LLC.

第 12 章的部分内容和尾声出现在Quanta Magazine中。第 6 章中关于对 EEG 数据进行 PCA 的插图经 John Abel 许可改编。第 12 章中关于偏差-方差和双下降曲线的插图经 Mikhail Belkin 许可改编。第 4 章中关于企鹅属性的插图由 Kristen Gorman、Allison Horst 和 Alison Hill 免费提供的数据制作。生物神经元(本页)、稻田(本页)和曼哈顿地图(本页)的插图由 Roshan Shakeel 制作。

Portions of chapter 12 and the epilogue appeared in Quanta Magazine. The illustration in chapter 6 on PCA done on EEG data adapted with permission from John Abel. The illustrations in chapter 12 on the bias-variance and double descent curves adapted with permission from Mikhail Belkin. Illustrations about properties of penguins in chapter 4 created courtesy of data made freely available by Kristen Gorman, Allison Horst, and Alison Hill. The illustrations of biological neuron (this page), paddy fields (this page), and the map of Manhattan (this page) by Roshan Shakeel.

国会图书馆出版品目錄數據

LIBRARY OF CONGRESS CATALOGING-IN-PUBLICATION DATA

姓名:Ananthaswamy、Anil,作家。

Names: Ananthaswamy, Anil, author.

标题:机器学习的原因:现代人工智能背后的优雅数学/Anil Ananthaswamy。

Title: Why machines learn : the elegant math behind modern AI / Anil Ananthaswamy.

描述:纽约:Dutton,[2024] | 包括书目参考和索引。

Description: New York : Dutton, [2024] | Includes bibliographical references and index.

标识符:LCCN 2024000738 | ISBN 9780593185742 (精装本) | ISBN 9780593185759 (电子书)

Identifiers: LCCN 2024000738 | ISBN 9780593185742 (hardcover) | ISBN 9780593185759 (ebook)

主题:LCSH:机器学习。| 深度学习(机器学习)| 人工智能。| 数学。

Subjects: LCSH: Machine learning. | Deep learning (Machine learning) | Artificial intelligence. | Mathematics.

分类:LCC Q325.5 .A56 2024 | DDC 006.3/1—dc23/eng/20240326

Classification: LCC Q325.5 .A56 2024 | DDC 006.3/1—dc23/eng/20240326

LC 记录可在https://lccn.loc.gov/2024000738上查阅

LC record available at https://lccn.loc.gov/2024000738

电子书 ISBN 9780593185759

Ebook ISBN 9780593185759

封面设计:Dominique Jones

Cover design by Dominique Jones

杰森·布赫尔 (Jason Booher) 的插图“模仿 MC 埃舍尔”

Illustration by Jason Booher “after M.C. Escher”

书籍设计由 ASHLEY TUCKER 完成,电子书改编由 MOLLY JESZKE 完成

BOOK DESIGN BY ASHLEY TUCKER, ADAPTED FOR EBOOK BY MOLLY JESZKE

虽然作者已尽力在出版时提供准确的电话号码、互联网地址和其他联系信息,但出版商和作者均不对出版后出现的错误或更改承担任何责任。此外,出版商对作者或第三方网站及其内容没有任何控制权,也不承担任何责任。

While the author has made every effort to provide accurate telephone numbers, internet addresses, and other contact information at the time of publication, neither the publisher nor the author assumes any responsibility for errors or for changes that occur after publication. Further, the publisher does not have any control over and does not assume any responsibility for author or third-party websites or their content.

pid_prh_7.0_147548044_c0_r0

pid_prh_7.0_147548044_c0_r0

献给世界各地的教师,无论他们是否受人尊敬
to teachers everywhere, sung and unsung
无论我们做什么,我们都必须让自己的生活有方向。有力量和方向的线。
Whatever we do, we have to make our life vectors. Lines with force and direction.
— 连姆·尼森在 2017 年同名电影中饰演 FBI 探员马克·费尔特
—LIAM NEESON AS FBI AGENT MARK FELT IN THE 2017 MOVIE OF THE SAME NAME
作者衷心感谢阿尔弗雷德·P·斯隆基金会对本书研究和撰写的支持。
The author acknowledges with gratitude the support of the Alfred P. Sloan Foundation in the research and writing of this book.

序幕

Prologue

1958 年 7 月 8 日的《纽约时报》上刊登了一篇相当特别的文章。标题是“海军新设备边做边学:心理学家展示了旨在阅读和变得更聪明的计算机胚胎”。开篇段落提出了一个问题:“海军今天展示了电子计算机的胚胎,它有望行走、说话、看、写、自我复制并意识到自己的存在。”

Buried on this page of the July 8, 1958, issue of The New York Times was a rather extraordinary story. The headline read, “New Navy Device Learns by Doing: Psychologist Shows Embryo of Computer Designed to Read and Grow Wiser.” The opening paragraph raised the stakes: “The Navy revealed the embryo of an electronic computer today that it expects will be able to walk, talk, see, write, reproduce itself and be conscious of its existence.”

事后看来,这种夸张是显而易见的,令人尴尬。但《纽约时报》并非完全有责任。一些夸张的言论也来自弗兰克罗森布拉特是康奈尔大学的心理学家和项目工程师。罗森布拉特与在美国海军研究办公室的资助下,感知器发明了出来,在《纽约时报》报道发表的前一天,感知器的一个版本在新闻发布会上亮相。罗森布拉特认为,感知器将是“第一个像人脑一样思考的设备”,这种机器甚至可能作为“机械太空探测器”被发送到其他星球。

With hindsight, the hyperbole is obvious and embarrassing. But The New York Times wasn’t entirely at fault. Some of the over-the-top talk also came from Frank Rosenblatt, a Cornell University psychologist and project engineer. Rosenblatt, with funding from the U.S. Office of Naval Research, had invented the perceptron, a version of which was presented at a press conference the day before the New York Times story about it appeared in print. According to Rosenblatt, the perceptron would be the “first device to think as the human brain” and such machines might even be sent to other planets as “mechanical space explorers.”

这些都没有发生。感知器从未达到人们的预期。尽管如此,罗森布拉特的工作具有开创性。今天几乎每一位人工智能 (AI) 讲师都会回顾感知器。这是有道理的。历史上的这一刻——大型语言模型 (LLM) 的到来,例如 ChatGPT 及其同类我们对此的反应——有些人将其比作 20 世纪 10 年代和 20 年代物理学家面对量子力学疯狂时的感受——源于罗森布拉特发起的研究。《纽约时报》的一篇文章中有一则消息,暗示了感知器引发的革命:“罗森布拉特博士说,他只能用高度技术性的术语来解释为什么机器能够学习”(斜体字是我加的)。然而,这篇报道中没有任何“高度技术性的”细节。

None of this happened. The perceptron never lived up to the hype. Nonetheless, Rosenblatt’s work was seminal. Almost every lecturer on artificial intelligence (AI) today will harken back to the perceptron. And that’s justified. This moment in history—the arrival of large language models (LLMs) such as ChatGPT and its ilk and our response to it—which some have likened to what it must have felt like in the 1910s and ’20s, when physicists were confronted with the craziness of quantum mechanics, has its roots in research initiated by Rosenblatt. There’s a line in the New York Times story that only hints at the revolution the perceptron set in motion: “Dr. Rosenblatt said he could explain why the machine learned only in highly technical terms” (italics mine). The story, however, had none of the “highly technical” details.

这本书确实如此。它解决了技术细节问题。它解释了数十年来激励和激发“机器学习”研究人员的优雅数学和算法。机器学习是一种人工智能,涉及构建无需明确编程即可学会辨别数据模式的机器。经过训练的机器可以在新的、以前未见过的数据中检测出类似的模式,从而实现从识别猫狗​​图片到创建自动驾驶汽车和其他技术等各种应用。机器能够学习是因为数学和计算机科学的非凡融合,其中还加入了不少物理学和神经科学。

This book does. It tackles the technical details. It explains the elegant mathematics and algorithms that have, for decades, energized and excited researchers in “machine learning,” a type of AI that involves building machines that can learn to discern patterns in data without being explicitly programmed to do so. Trained machines can then detect similar patterns in new, previously unseen data, making possible applications that range from recognizing pictures of cats and dogs to creating, potentially, autonomous cars and other technology. Machines can learn because of the extraordinary confluence of math and computer science, with more than a dash of physics and neuroscience added to the mix.

机器学习 (ML) 是一个庞大的领域,其中充斥着各种算法,这些算法利用了几个世纪以来相对简单的数学知识,即人们在高中或大学早期学习的数学知识。当然,还有初等代数。机器学习的另一个极其重要的基石是微积分,它是由博学者艾萨克·牛顿共同发明的。该领域还严重依赖于托马斯·贝叶斯是十八世纪英国统计学家和牧师,他提出了贝叶斯定理,这是概率和统计学领域的一个重要贡献。德国数学家卡尔·弗里德里希·高斯提出的高斯分布(和钟形曲线)也渗透到了机器学习中。此外还有线性代数,它构成了机器学习的支柱。最早关于这一数学分支的阐述出现在两千年前的中国文献《九章算术》中。现代版本的线性代数源于许多数学家的工作,但主要是高斯、戈特弗里德·威廉·莱布尼茨、威廉·约当、加布里埃尔·克莱默、赫尔曼·君特·格拉斯曼、詹姆斯·约瑟夫·西尔维斯特和阿瑟·凯莱。

Machine learning (ML) is a vast field populated by algorithms that leverage relatively simple math that goes back centuries, math one learns in high school or early in college. There’s, of course, elementary algebra. Another extremely important cornerstone of machine learning is calculus, co-invented by no less a polymath than Isaac Newton. The field also relies heavily on the work of Thomas Bayes, the eighteenth-century English statistician and minister who gave us the eponymous Bayes’s theorem, a key contribution to the field of probability and statistics. The work of German mathematician Carl Friedrich Gauss on the Gaussian distribution (and the bell-shaped curve) also permeates machine learning. Then there’s linear algebra, which forms the backbone of machine learning. The earliest exposition of this branch of mathematics appears in a two-thousand-year-old Chinese text, Nine Chapters on the Mathematical Art. The modern version of linear algebra has its roots in the work of many mathematicians, but mainly Gauss, Gottfried Wilhelm Leibniz, Wilhelm Jordan, Gabriel Cramer, Hermann Günther Grassmann, James Joseph Sylvester, and Arthur Cayley.

到 19 世纪 50 年代中期,一些对构建学习机器必不可少的基本数学已经存在,而其他数学家则继续开发更相关的数学,并催生和推动了计算机科学领域。然而,很少有人能想到,这些早期的数学工作将成为过去半个世纪,特别是过去十年人工智能惊人发展的基础,其中一些发展可能让我们有理由设想出罗森布拉特在 1950 年代过于乐观地预言的那种未来。

By the mid-1850s, some of the basic math that would prove necessary to building learning machines was in place, even as other mathematicians continued developing more relevant mathematics and birthed and advanced the field of computer science. Yet, few could have dreamed that such early mathematical work would be the basis for the astounding developments in AI over the past half century, particularly over the last decade, some of which may legitimately allow us to envision a semblance of the kind of future Rosenblatt was overoptimistically foreshadowing in the 1950s.

本书通过机器学习领域中的关键数学思想,讲述了从罗森布拉特的感知器到现代深度神经网络(称为人工神经元的复杂计算单元网络)的历程。本书先从数学角度讲起,然后慢慢地逐渐增加难度,我们从 20 世纪 50 年代相对简单的思想过渡到当今机器学习系统所依赖的更为复杂的数学和算法。

This book tells the story of this journey, from Rosenblatt’s perceptron to modern-day deep neural networks, elaborate networks of computational units called artificial neurons, through the lens of key mathematical ideas underpinning the field of machine learning. It eases gently into the math and then, ever so slowly, ratchets up the difficulty, as we go from the relatively simple ideas of the 1950s to the somewhat more involved math and algorithms that power today’s machine learning systems.

因此,我们将毫不掩饰地接受至少四个主要数学领域的方程式和概念——线性代数、微积分、概率和统计以及优化理论——以获得必要的最低限度的理论和概念知识,以欣赏我们赋予机器的强大力量。只有当我们理解学习机器的必然性时,我们才能准备好应对人工智能无处不在的未来,无论是好是坏。

Hence, we will unabashedly embrace equations and concepts from at least four major fields of mathematics—linear algebra, calculus, probability and statistics, and optimization theory—to acquire the minimum theoretical and conceptual knowledge necessary to appreciate the awesome power we are bestowing on machines. It is only when we understand the inevitability of learning machines that we will be prepared to tackle a future in which AI is ubiquitous, for good and for bad.

了解机器学习的数学原理对于我们理解这项技术的力量和局限性都至关重要。机器学习系统已经在为我们做出改变生活的决定:批准信用卡申请和抵押贷款、确定肿瘤是否是癌症、预测认知能力下降者的预后(他们会患上阿尔茨海默病吗?)以及决定是否批准某人保释。机器学习也渗透到了科学领域:它影响着化学、生物学、物理学以及介于两者之间的一切。它被用于研究基因组、太阳系外行星、量子系统的复杂性等等。在撰写本文时,随着 ChatGPT 等大型语言模型的出现,人工智能世界一片沸腾。一切才刚刚开始。

Getting under the mathematical skin of machine learning is crucial to our understanding of not just the power of the technology, but also its limitations. Machine learning systems are already making life-altering decisions for us: approving credit card applications and mortgage loans, determining whether a tumor is cancerous, predicting the prognosis for someone in cognitive decline (will they go on to get Alzheimer’s?), and deciding whether to grant someone bail. Machine learning has permeated science, too: It is influencing chemistry, biology, physics, and everything in between. It’s being used in the study of genomes, extrasolar planets, the intricacies of quantum systems, and much more. And as of this writing, the world of AI is abuzz with the advent of large language models such as ChatGPT. The ball has only just gotten rolling.

我们不能把如何构建和部署人工智能的决定权完全交给从业者。如果我们要有效地监管这项极其有用但又具有破坏性和潜在威胁性的技术,社会的另一层——教育工作者、政治家、政策制定者、科学传播者,甚至对人工智能感兴趣的消费者——必须掌握机器学习的数学基础知识。

We cannot leave decisions about how AI will be built and deployed solely to its practitioners. If we are to effectively regulate this extremely useful, but disruptive and potentially threatening, technology, another layer of society—educators, politicians, policymakers, science communicators, or even interested consumers of AI—must come to grips with the basics of the mathematics of machine learning.

数学家 Eugenia Cheng在她的著作《数学是真实的吗? 》中,写到了学习数学的循序渐进的过程:“有时候,我们似乎只迈出了很小的一步,却一无所获,但突然间,我们回头一看,却发现自己已经爬上了一座大山。所有这些事情都可能令人不安,但接受一点智力上的不适(有时是很多不适)是数学进步的重要组成部分。”

In her book Is Math Real?, mathematician Eugenia Cheng writes about the gradual process of learning mathematics: “It can…seem like we’re taking very small steps and not getting anywhere, before suddenly we look behind us and discover we’ve climbed a giant mountain. All these things can be disconcerting, but accepting a little intellectual discomfort (or sometimes a lot of it) is an important part of making progress in math.”

幸运的是,我们面临的“智力不适”是可以忍受的,而且智力上的回报足以缓解这种不适,因为现代机器学习的底层是一些相对简单而优雅的数学——这个概念可以用伊利亚·Sutskever。如今,Sutskever 最为人熟知的身份是 OpenAI 的联合创始人,该公司是 ChatGPT 背后的公司。十多年前,Sutskever 还是一名在多伦多大学寻找学术顾问的年轻本科生,他敲开了 Geoffrey Hinton 的门。Hinton 已经是“深度学习”(一种机器学习)领域的知名人物,Sutskever 想与他合作。Hinton 给了 Sutskever 一些论文阅读,Sutskever 如饥似渴地阅读。他记得,与他常规本科课程的数学和物理相比,这些论文的数学很简单,令他感到困惑。他可以阅读这些关于深度学习的论文并理解强大的概念。““怎么会这么简单……简单到你可以毫不费力地向高中生解释清楚?”他告诉我。“我认为这真的是一个奇迹。对我来说,这也表明我们可能走在正确的道路上。如此简单的概念能走这么远,[这不可能]是巧合。”

Fortunately, the “intellectual discomfort” in store for us is eminently endurable and more than assuaged by the intellectual payoff, because underlying modern ML is some relatively simple and elegant math—a notion that’s best illustrated with an anecdote about Ilya Sutskever. Today, Sutskever is best known as the co-founder of OpenAI, the company behind ChatGPT. More than a decade ago, as a young undergraduate student looking for an academic advisor at the University of Toronto, Sutskever knocked on Geoffrey Hinton’s door. Hinton was already a well-known name in the field of “deep learning,” a form of machine learning, and Sutskever wanted to work with him. Hinton gave Sutskever some papers to read, which he devoured. He remembers being perplexed by the simplicity of the math, compared to the math and physics of his regular undergrad coursework. He could read these papers on deep learning and understand powerful concepts. “How can it be that it’s so simple…so simple that you can explain it to high school students without too much effort?” he told me. “I think that’s actually miraculous. This is also, to me, an indication that we are probably on the right track. [It can’t] be a coincidence that such simple concepts go so far.”

当然,Sutskever 已经拥有了高超的数学能力,所以对他来说很简单的事情,对包括我在内的大多数人来说可能并非如此。但让我们拭目以待吧。

Of course, Sutskever already had sophisticated mathematical chops, so what seemed simple to him may not be so for most of us, including me. But let’s see.

本书旨在传达机器学习和深度学习背后的概念简单性。这并不是说我们现在在人工智能中看到的一切——特别是深度神经网络和大型语言模型的行为——都可以用简单的数学来分析。事实上,这本书的结局将我们引向了一个有些人可能会感到不安,而另一些人则会感到兴奋的地方:这些网络和人工智能似乎藐视了几十年来支撑机器学习的一些基本思想。就好像经验证据已经压垮了理论骆驼,就像 20 世纪早期对物质世界的实验观察打破了经典物理学一样;我们需要一些新的东西来理解等待我们的美丽新世界。

This book aims to communicate the conceptual simplicity underlying ML and deep learning. This is not to say that everything we are witnessing in AI now—in particular, the behavior of deep neural networks and large language models—is amenable to being analyzed using simple math. In fact, the denouement of this book leads us to a place that some might find disconcerting, though others will find it exhilarating: These networks and AIs seem to flout some of the fundamental ideas that have, for decades, underpinned machine learning. It’s as if empirical evidence has broken the theoretical camel’s back in the same way experimental observations of the material world in the early twentieth century broke classical physics; we need something new to make sense of the brave new world awaiting us.

在为本书做研究时,我观察到一种学习模式,这让我想起了现代人工神经网络的学习方式:算法每处理一次数据,它就会更多地了解数据中存在的模式。一次处理可能不够;十次也不够;一百次也不够。有时,神经网络需要通过数万次迭代来学习数据。这确实是我理解这个主题并写下来的方式。每次处理这个庞大知识库的某个角落,都会让我大脑中的一些神经元建立联系,无论是字面上还是比喻上。第一次或第二次没有意义的事情最终在以后的几次处理中变得有意义了。

As I did the research for this book, I observed a pattern to my learning that reminded me of the way modern artificial neural networks learn: With each pass the algorithm makes through data, it learns more about the patterns that exist in that data. One pass may not be enough; nor ten; nor a hundred. Sometimes, neural networks learn over tens of thousands of iterations through the data. This is indeed the way I grokked the subject in order to write about it. Each pass through some corner of this vast base of knowledge caused some neurons in my brain to make connections, literally and metaphorically. Things that didn’t make sense the first or second time around eventually did upon later passes.

我曾使用这种技巧来帮助读者建立类似的联系:在写这本书的过程中,我发现自己不断重复想法和概念,有时使用相同的措辞,有时对同一概念有不同的看法。这些重复和重新表述是故意的:它们是我们大多数不是数学家或机器学习从业者的人理解一个看似简单却又复杂的主题的一种方式。一旦一个想法浮现出来,我们的大脑可能会在其他地方遇到这个想法时看到模式并建立联系,从而比乍一看更能理解它。

I have used this technique to help readers make similar connections: I found myself repeating ideas and concepts over the course of writing this book, sometimes using the same phrasing or, at times, a different take on the same concept. These repetitions and rephrasings are intentional: They are one way that most of us who are not mathematicians or practitioners of ML can come to grips with a paradoxically simple yet complex subject. Once an idea is exposed, our brains might see patterns and make connections when encountering that idea elsewhere, making more sense of it than would have been possible at first blush.

我希望你的神经元也能像我的一样享受这个过程。

I hope your neurons enjoy this process as much as mine did.

第一章

CHAPTER 1

拼命寻找模式

Desperately Seeking Patterns

奥地利科学家康拉德·洛伦兹小时候迷恋《尼尔斯历险记》中的故事——这本书讲述了一个小男孩与大雁的冒险故事,作者是瑞典小说家、诺贝尔文学奖获得者塞尔玛·拉格洛夫。洛伦兹“渴望变成一只大雁”。无法满足自己幻想的洛伦兹,决定照顾邻居送的一只刚出生的鸭子。让洛伦兹高兴的是,这只鸭子开始跟着他:它已经对他留下了深刻的印象。“印记”是指许多动物,包括小鸭子和小鹅,在孵化后与看到的第一个移动的东西建立联系的能力。洛伦兹后来成为一名动物行为学家,并开创了动物行为领域,特别是印记方面的研究。他让小鸭子对他产生印象;他走路、跑步、游泳,甚至划独木舟时,小鸭子都跟着他。)1973 年,他与同为动物行为学家的卡尔·冯·弗里施和尼古拉斯·廷伯根共同获得了诺贝尔生理学或医学奖。三人被誉为“以表彰他们在个人和社会行为模式的组织和引发方面的发现。

When he was a child, the Austrian scientist Konrad Lorenz, enamored by tales from a book called The Wonderful Adventures of Nils—the story of a boy’s adventures with wild geese written by the Swedish novelist and winner of the Nobel Prize for Literature, Selma Lagerlöf—“yearned to become a wild goose.” Unable to indulge his fantasy, the young Lorenz settled for taking care of a day-old duckling his neighbor gave him. To the boy’s delight, the duckling began following him around: It had imprinted on him. “Imprinting” refers to the ability of many animals, including baby ducks and geese (goslings), to form bonds with the first moving thing they see upon hatching. Lorenz would go on to become an ethologist and would pioneer studies in the field of animal behavior, particularly imprinting. (He got ducklings to imprint on him; they followed him around as he walked, ran, swam, and even paddled away in a canoe.) He won the Nobel Prize for Physiology or Medicine in 1973, jointly with fellow ethologists Karl von Frisch and Nikolaas Tinbergen. The three were celebrated “for their discoveries concerning organization and elicitation of individual and social behavior patterns.

模式。当动物行为学家在动物的行为中辨别模式时,动物也在检测自己的模式。刚孵出来的小鸭子必须有能力辨别或分辨鸭子们不仅能记住它们看到的移动物体的属性,还能记住它们看到的移动物体的属性。事实证明,鸭子们不仅能记住它们看到的第一个移动的生物,还能记住无生命的东西。例如,绿头鸭可以记住一对形状或颜色相似的移动物体。具体来说,它们印刻在对象所体现的关系概念上。因此,如果小鸭子出生时看到两个移动的红色物体,它们以后会跟随两个相同颜色的物体(即使后者是蓝色而不是红色),但不会跟随两个不同颜色的物体。在这种情况下,小鸭子会记住相似性的概念。它们还表现出辨别不同事物的能力。如果小鸭子看到的第一个移动物体是立方体和长方体,它们会认识到这些物体具有不同的形状,以后会跟随两个形状不同的物体(例如金字塔和圆锥体),但它们会忽略两个具有相同形状的物体。

Patterns. While the ethologists were discerning them in the behavior of animals, the animals were detecting patterns of their own. Newly hatched ducklings must have the ability to make out or tell apart the properties of things they see moving around them. It turns out that ducklings can imprint not just on the first living creature they see moving, but on inanimate things as well. Mallard ducklings, for example, can imprint on a pair of moving objects that are similar in shape or color. Specifically, they imprint on the relational concept embodied by the objects. So, if upon birth the ducklings see two moving red objects, they will later follow two objects of the same color (even if those latter objects are blue, not red), but not two objects of different colors. In this case, the ducklings imprint on the idea of similarity. They also show the ability to discern dissimilarity. If the first moving objects the ducklings see are, for example, a cube and a rectangular prism, they will recognize that the objects have different shapes and will later follow two objects that are different in shape (a pyramid and a cone, for example), but they will ignore two objects that have the same shape.

仔细思考一下。刚出生的小鸭子在短暂接触感官刺激后,就能从所见事物中发现规律,形成相似/不相似的抽象概念,然后会在之后看到的刺激中识别出这些抽象概念并采取行动。人工智能研究人员会不惜一切代价想知道小鸭子是如何做到这一点的。

Ponder this for a moment. Newborn ducklings, with the briefest of exposure to sensory stimuli, detect patterns in what they see, form abstract notions of similarity/dissimilarity, and then will recognize those abstractions in stimuli they see later and act upon them. Artificial intelligence researchers would offer an arm and a leg to know just how the ducklings pull this off.

虽然今天的人工智能还远不能像小鸭子那样轻松高效地完成这些任务,但它确实与小鸭子有一些共同之处,那就是能够挑选和学习数据中的模式。弗兰克·罗森布拉特 (Frank Rosenblatt) 在 20 世纪 50 年代末发明了感知器,它之所以引起轰动,原因之一是它是第一个强大的“受大脑启发”的算法,只需检查数据就能了解数据中的模式。最重要的是,在对数据进行某些假设的情况下,研究人员证明,罗森布拉特的感知器总能在有限的时间内找到隐藏在数据中的模式;或者换句话说,感知器感知器一定会收敛到一个解决方案。这种计算上的确定性就像金粉一样。难怪感知器学习算法会引起如此大的轰动。

While today’s AI is far from being able to perform such tasks with the ease and efficiency of ducklings, it does have something in common with the ducklings, and that’s the ability to pick out and learn about patterns in data. When Frank Rosenblatt invented the perceptron in the late 1950s, one reason it made such a splash was because it was the first formidable “brain-inspired” algorithm that could learn about patterns in data simply by examining the data. Most important, given certain assumptions about the data, researchers proved that Rosenblatt’s perceptron will always find the pattern hidden in the data in a finite amount of time; or, put differently, the perceptron will converge upon a solution without fail. Such certainties in computing are like gold dust. No wonder the perceptron learning algorithm created such a fuss.

但是这些术语是什么意思呢?数据中的“模式”是什么?“了解这些模式”意味着什么?让我们先来看看这张表:

But what do these terms mean? What are “patterns” in data? What does “learning about these patterns” imply? Let’s start by examining this table:

表中的每一行都是变量x1、x2y 的三元组值。这些数据中隐藏着一个简单的模式:在每一行中,y的值与x1x2的相应值相关。在继续阅读之前,看看你是否能发现它。

Each row in the table is a triplet of values for variables x1, x2, and y. There’s a simple pattern hidden in this data: In each row, the value of y is related to the corresponding values of x1 and x2. See if you can spot it before reading further.

在这种情况下,用铅笔、纸,加上一点努力,就能算出y等于x1加上两倍x2

In this case, with a pencil, paper, and a little effort one can figure out that y equals x1 plus two times x2.

y = x 1 + 2 x 2

y = x1 + 2x2

关于符号的一个小提示:我们将省去两个变量之间或常量和变量之间的乘号(“×”)。例如,我们将写成

A small point about notation: We are going to dispense with the multiplication sign (“×”) between two variables or between a constant and a variable. For example, we’ll write

2 × x 2 为 2 x 2 且x 1 × x 2 为x 1 x 2

2 × x2 as 2x2 and x1 × x2 as x1x2

理想情况下,我们应该将 2 x 2 写成 2 x 2,将x 1 x 2 写成x 1 x 2,并将变量加下标。但是,除非绝对有必要使用下标,否则我们也会省去下标。(纯粹主义者会感到不快,但这种方法有助于让我们的文本更简洁,更赏心悦目;当我们遇到下标时,将x i读作“x sub-i”。)因此,请记住这一点:如果有一个符号(例如“ x ”)后跟一个数字(例如“2”),得到x2,则将整个符号视为一个意思。如果一个符号(例如xx2)前面有一个数字(例如 9)或另一个符号(例如w1),那么这个数字和符号,或者这两个符号,就是相乘的。所以:

Ideally, we should write 2x2 as 2x2 and x1x2 as x1 x2, with the variables subscripted. But we’ll dispense with the subscripts, too, unless it becomes absolutely necessary to use them. (Purists will cringe, but this method helps keep our text less cluttered and easy on the eye; when we do encounter subscripts, read xi as “x sub-i.”) So, keep this in mind: If there’s a symbol such as “x” followed by a digit such as “2,” giving us x2, take the entire symbol to mean one thing. If a symbol (say, x or x2) is preceded by a number (say, 9), or by another symbol (say, w1), then the number and the symbol, or the two symbols, are being multiplied. So:

2 x 2 = 2 × x 2

2x2 = 2 × x2

x 1 x 2 = x 1 × x 2

x1x2 = x1 × x2

w2x1 = w2 × x1

w2x1 = w2 × x1

回到我们的方程y = x 1 + 2 x 2,更一般地,我们可以将其写成:

Getting back to our equation y = x1 + 2x2, more generally, we can write this as:

y = w 1 x 1 + w 2 x 2,其中w1 = 1 且w2 = 2

y = w1x1 + w2x2, where w1 = 1 and w2 = 2

需要明确的是,我们已经找到了yx1和x2之间众多可能的关系之一可能还有其他关系。事实上,对于这个例子来说,确实有,但就我们的目的而言,我们不需要担心它们。寻找模式远没有这个例子所暗示的那么简单,但它可以让我们继续前进。

To be clear, we have found one of the many possible relationships between y and x1 and x2. There can be others. And indeed, for this example, there are, but we don’t need to worry about them for our purposes here. Finding patterns is nowhere near as simple as this example is suggesting, but it gets us going.

我们确定了y与x1和x2之间的线性关系。(“线性”表示y仅取决于x1x2,而不取决于x1x2 的某个幂,也不取决于x1和x2 的任何乘积)此外,我在这里交替使用“方程”和“关系”这两个词。

We identified what’s called a linear relationship between y, on the one hand, and x1 and x2, on the other. (“Linear” means that y depends only on x1 and x2, and not on x1 or x2 raised to some power, or on any product of x1 and x2.) Also, I’m using the words “equation” and “relationship” interchangeably here.

y、x1x2之间的关系由常数w1w2 定义。这些常数称为连接yx1和x2的线性方程的系数或权重在这个简单情况下,假设存在这样的线性关系我们在检查数据后算出了w1w2的值。但通常, y和 ( x1, x2,… ) 之间的关系并不那么简单,尤其是当它扩展到等式右侧的更多值时。

The relationship between y, x1, and x2 is defined by the constants w1 and w2. These constants are called the coefficients, or weights, of the linear equation connecting y to x1 and x2. In this simple case, assuming such a linear relationship exists, we figured out the values for w1 and w2 after inspecting the data. But often, the relationship between y and (x1, x2,…) is not so straightforward, especially when it extends to more values on the right side of the equation.

例如,考虑:

For example, consider:

y = w 1 x 1 + w 2 x 2 + w 3 x 3 + ··· + w 9 x 9

y = w1x1 + w2x2 + w3x3 + ··· + w9x9

或者,更一般地,对于一组n 个权重,并使用正式的数学符号:

Or, more generally, for a set of n weights, and using formal mathematical notation:

右边的表达式使用了 sigma 符号,是对所有wixi 求和的简写,其中i取值范围是 1 至n。

The expression on the right, using the sigma notation, is shorthand for summing all wixi, where i takes on values from 1 to n.

如果输入有 9 个,那么仅通过目视检查数据并进行心算,很难提取w1w9的值。这就是学习的作用所在。如果有一种方法可以通过算法找出权重,那么算法就是在“学习”权重。但这样做有什么意义呢?

In the case of 9 inputs, you’d be hard-pressed to extract the values of w1 to w9 just by visually inspecting the data and doing some mental arithmetic. That’s where learning comes in. If there’s a way to algorithmically figure out the weights, then the algorithm is “learning” the weights. But what’s the point of doing that?

好吧,一旦你了解了权重(比如,我们简单的例子中的w1w2 ),然后给出一些不在我们初始数据集中的x1x2值,我们就可以计算出y 的值。比如,x1 = 5 和x2 = 2。将这些值代入方程y = x 1 + 2 x 2,你将得到y = 9 的值。

Well, once you have learned the weights—say, w1 and w2 in our simple, toy example—then given some value of x1 and x2 that wasn’t in our initial dataset, we can calculate the value of y. Say, x1 = 5 and x2 = 2. Plug these values into the equation y = x1 + 2x2 and you get a value of y = 9.

这一切与现实生活有什么关系?举一个非常简单、实用,甚至有人会说非常无聊的问题。假设x1表示房子的卧室数量,x2表示总平方英尺数,y表示房价。假设 ( x1, x2 ) 和y 之间存在线性关系。然后,通过从一些现有的通过有关房屋及其价格的数据,我们基本上建立了一个非常简单的模型,可以根据卧室数量和面积来预测房屋的价格。

What’s all this got to do with real life? Take a very simple, practical, and some would say utterly boring problem. Let’s say x1 represents the number of bedrooms in a house, and x2 represents the total square footage, and y represents the price of the house. Let’s assume that there exists a linear relationship between (x1, x2) and y. Then, by learning the weights of the linear equation from some existing data about houses and their prices, we have essentially built a very simple model with which to predict the price of a house, given the number of bedrooms and the square footage.

上面的例子——其实只是一小步而已——是机器学习的开始。我们刚刚做的是监督学习的简单形式。我们获得了一些数据样本,这些数据样本中隐藏了一组输入和一组输出之间的某种相关性。这些数据被称为注释或标记;它们也称为训练数据。每个输入 ( x1, x2,…, xn ) 都附有一个标签y。因此,在我们之前的数字表中,数字对 (4, 2) 标记为y = 8,数字对 (1, 2) 标记为 5,依此类推。我们算出了相关性。一旦学会了相关性,我们就可以用它来预测不属于训练数据的新输入。

The above example—a teeny, tiny baby step, really—is the beginning of machine learning. What we just did is a simplistic form of something called supervised learning. We were given samples of data that had hidden in them some correlation between a set of inputs and a set of outputs. Such data are said to be annotated, or labeled; they are also called the training data. Each input (x1, x2,…, xn) has a label y attached to it. So, in our earlier numerical table, the pair of numbers (4, 2) is labeled with y = 8, the pair (1, 2) with 5, and so on. We figured out the correlation. Once it is learned, we can use it to make predictions about new inputs that weren’t part of the training data.

此外,我们还进行了一种非常特殊的问题求解,称为回归,给定一些独立变量(x1,x2 ),我们建立一个模型(或方程)来预测因变量( y )的值。我们还可以构建许多其他类型的模型,我们将在适当的时候介绍它们。

Also, we did a very particular kind of problem solving called regression, where given some independent variables (x1, x2), we built a model (or equation) to predict the value of a dependent variable (y). There are many other types of models we could have built, and we’ll come to them in due course.

在这种情况下,相关性或模式非常简单,我们只需要少量的标记数据。但现代机器学习需要的数量级要多得多——而此类数据的可用性一直是推动人工智能革命的因素之一。(小鸭子可能会沉迷于一种更复杂的学习方式。没有父母会坐在那里为小鸭子标记数据,但小鸭子会学习。它们是怎么做到的?剧透:我们不知道,但也许通过了解机器学习的原因,我们有一天可以完全理解小鸭子,甚至人类是如何学习的。)

In this case, the correlation, or pattern, was so simple that we needed only a small amount of labeled data. But modern ML requires orders of magnitude more—and the availability of such data has been one of the factors fueling the AI revolution. (The ducklings, for their part, likely indulge in a more sophisticated form of learning. No parent duck sits around labeling the data for its ducklings, and yet the babies learn. How do they do it? Spoiler alert: We don’t know, but maybe by understanding why machines learn, we can one day fully understand how ducklings and, indeed, humans learn.)

这看起来似乎难以置信,但我们用一个可笑的简单的监督学习例子迈出的第一步,让我们走上了一条通往理解现代深度神经网络——当然是一步一步来(在此过程中,根据需要,使用少量、温和的,有时可能不是那么温和的向量、矩阵、线性代数、微积分、概率和统计以及优化理论)。

It may seem implausible, but this first step we took using a laughably simple example of supervised learning sets us on a path toward understanding modern deep neural networks—one step at a time, of course (with small, gentle, and occasionally maybe not so gentle dollops of vectors, matrices, linear algebra, calculus, probability and statistics, and optimization theory served, as needed, along the way).

我们在序言中简要提到过的罗森布拉特感知机,在当时是此类学习算法的一个惊人例子。由于它模仿了神经科学家对人类神经元工作方式的看法,因此它充满了神秘感,并承诺有一天,感知机确实会兑现人工智能的承诺。

Rosenblatt’s perceptron, which we briefly encountered in the prologue, was for its time an astonishing example of one such learning algorithm. And because it was modeled on how neuroscientists thought human neurons worked, it came imbued with mystique and the promise that, one day, perceptrons would indeed make good on the promise of AI.

第一个人工神经元

THE FIRST ARTIFICIAL NEURON

感知器起源于 1943 年的一篇论文,作者是一位四十五岁左右、具有哲学头脑的神经科学家和一位无家可归的青少年,两人合力发表了这篇论文。沃伦·麦卡洛克是一位美国神经生理学家,受过哲学、心理学和医学方面的培训。20 世纪 30 年代,他致力于神经解剖学研究,绘制了猴脑各部分的连接图。在此过程中,他还痴迷于“大脑的逻辑。” 那时,数学家和哲学家(如艾伦·图灵、阿尔弗雷德·诺斯·怀特黑德和伯特兰·罗素)的工作表明计算和逻辑之间存在着深刻的联系。“如果 P 为真且 Q 为真,则 S 为真”这一陈述就是逻辑命题的一个例子。断言是,所有计算都可以归结为这样的逻辑。考虑到这种计算思维方式,困扰麦卡洛克的问题是:如果大脑是一种计算设备(许多人认为如此),那么它如何实现这种逻辑?

The perceptron’s roots lie in a 1943 paper by an unlikely combination of a philosophically minded neuroscientist in his mid-forties and a homeless teenager. Warren McCulloch was an American neurophysiologist trained in philosophy, psychology, and medicine. During the 1930s, he worked on neuroanatomy, creating maps of the connectivity of parts of monkey brains. While doing so, he also obsessed over the “logic of the brain.” By then, the work of mathematicians and philosophers like Alan Turing, Alfred North Whitehead, and Bertrand Russell was suggesting a deep connection between computation and logic. The statement “If P is true AND Q is true, then S is true” is an example of a logical proposition. The assertion was that all computation could be reduced to such logic. Given this way of thinking about computation, the question bothering McCulloch was this: If the brain is a computational device, as many think it is, how does it implement such logic?

带着这些问题,麦卡洛克于 1941 年从耶鲁大学转到伊利诺伊大学,在那里他遇到了一位才华横溢的一个名叫沃尔特·皮茨的天才少年。这个年轻人已经是一位成就卓著的逻辑学家(“他是著名数理逻辑学家鲁道夫·卡尔纳普的门生,正在参加研讨会由乌克兰数学物理学家 Nicolas Rashevsky 在芝加哥经营。然而,皮茨是一位“一个糊涂的青少年,本质上是一个离家出走的人,因为他的家庭无法欣赏他的天赋。”麦卡洛克和他的妻子鲁克给了沃尔特一个家。“计算机科学家迈克尔·阿比布写道:“他们无数个夜晚围坐在麦卡洛克的厨房餐桌旁,试图弄清楚大脑是如何工作的,麦卡洛克的女儿塔菲则画了一些小画。”塔菲的画后来成为麦卡洛克和皮茨 1943 年论文《神经活动中内在思想的逻辑演算》的插图。

With these questions in mind, McCulloch moved in 1941 from Yale University to the University of Illinois, where he met a prodigiously talented teenager named Walter Pitts. The youngster, already an accomplished logician (“a protégé of the eminent mathematical logician Rudolf Carnap”), was attending seminars run by Ukrainian mathematical physicist Nicolas Rashevsky in Chicago. Pitts, however, was a “mixed-up adolescent, essentially a runaway from a family that could not appreciate his genius.” McCulloch and his wife, Rook, gave Walter a home. “There followed endless evenings sitting around the McCulloch kitchen table trying to sort out how the brain worked, with the McCullochs’ daughter Taffy sketching little pictures,” wrote computer scientist Michael Arbib. Taffy’s drawings would later illustrate McCulloch and Pitts’s 1943 paper, “A Logical Calculus of the Ideas Immanent in Nervous Activity.”

在这项研究中,McCulloch 和 Pitts 提出了一种简单的生物神经元模型。首先,这是一般生物神经元的示意图:

In that work, McCulloch and Pitts proposed a simple model of a biological neuron. First, here’s an illustration of a generic biological neuron:

神经元的细胞体通过其树状突起(称为树突)接收输入。细胞体对这些输入进行一些计算。然后,根据计算结果,它可能会发送一个电信号,该信号沿着另一个较长的突起(称为轴突)尖峰传播。该信号沿着轴突传播并到达其分支终端,在那里它被传递到相邻的神经元。如此循环往复。以这种方式相互连接的神经元形成生物神经网络。

The neuron’s cell body receives inputs via its treelike projections, called dendrites. The cell body performs some computation on these inputs. Then, based on the results of that computation, it may send an electrical signal spiking along another, longer projection, called the axon. That signal travels along the axon and reaches its branching terminals, where it’s communicated to the dendrites of neighboring neurons. And so it goes. Neurons interconnected in this manner form a biological neural network.

McCulloch 和 Pitts 将其变成了一个简单的计算模型,一个人工神经元。他们展示了如何使用这样一个人工神经元或神经节点(即“神经元”+“节点”)来实现某些基本的布尔逻辑运算,如 AND、OR、NOT 等,这些运算是数字计算的基石。(对于某些布尔运算,如异或或 XOR,您需要多个神经节点,但稍后会详细介绍。)下面是单个神经节点的图像。(暂时忽略神经元内的“ g ”和“ f ”;我们稍后会讲到它们。)

McCulloch and Pitts turned this into a simple computational model, an artificial neuron. They showed how by using one such artificial neuron, or neurode (for “neuron” + “node”), one could implement certain basic Boolean logical operations such as AND, OR, NOT, and so on, which are the building blocks of digital computation. (For some Boolean operations, such as exclusive-OR, or XOR, you need more than one neurode, but more on this later.) What follows is an image of a single neurode. (Ignore the “g” and “f” inside the neuron for now; we’ll come to those in a moment.)

在这个简单版本的 McCulloch-Pitts 模型中,x1x2可以是 0 或 1。用正式符号来说,我们可以说:

In this simple version of the McCulloch-Pitts model, x1 and x2 can be either 0 or 1. In formal notation, we can say:

x 1, x 2 ∈ {0,1}

x1, x2 ∈ {0,1}

应该这样理解:x1是集合 {0, 1} 的一个元素,而x2是集合 {0, 1} 的一个元素;x1x2只能取值 0 或 1,不能取其他值。神经元的输出y的计算方法是先将输入相加,然后检查该和是否大于或等于某个阈值theta ( θ )。如果是,y等于 1;如果不是,y等于 0。

That should be read as x1 is an element of the set {0, 1} and x2 is an element of the set {0, 1}; x1 and x2 can take on only values 0 or 1 and nothing else. The neurode’s output y is calculated by first summing the inputs and then checking to see if that sum is greater than or equal to some threshold, theta (θ). If so, y equals 1; if not, y equals 0.

总和= x 1 + x 2

sum = x1 + x2

如果总和≥θ : y = 1

If sum θ: y = 1

否则y = 0

Else: y = 0

将其推广到任意输入序列x1、x2、x3、…、xn,我们可以写下简单神经元的正式数学描述。首先,我们定义函数g ( x ) ——读作“x 的 g”,其中x是输入集 ( x1、x2、x3、…、xn )——它将输入相加。然后我们定义函数f ( g ( x )) ——同样读作“x 的 g 的 f”——它进行求和并执行阈值处理以生成输出y:如果g ( x ) 小于某个θ ,则为零;如果g ( x ) 大于或等于θ ,则为1 。

Generalizing this to an arbitrary sequence of inputs, x1, x2, x3,…, xn, one can write down the formal mathematical description of the simple neurode. First, we define the function g(x)read that as “g of x,” where x here is the set of inputs (x1, x2, x3,…, xn)—which sums up the inputs. Then we define the function f(g(x))again, read that as “f of g of x”—which takes the summation and performs the thresholding to generate the output, y: It is zero if g(x) is less than some θ and 1 if g(x) is greater than or equal to θ.

有了上述的一个人工神经元,我们就可以设计一些基本的布尔逻辑门(例如 AND 和 OR)。在 AND 逻辑门中,如果x1x2都等于 1,则输出y应为 1 ;否则,输出应为 0。在这种情况下, θ = 2 就可以了。现在,只有当x1x2都为 1 时,输出y才会为 1 (只有这样x1 + x2才会大于或等于 2)。您可以玩用θ值来设计其他逻辑门。例如,在或门中,如果x1x2为 1,则输出应为 1;否则,输出应为 0。θ 应该多少?

With one artificial neuron as described, we can design some of the basic Boolean logic gates (AND & OR, for example). In an AND logic gate, the output y should be 1 if both x1 and x2 are equal to 1; otherwise, the output should be 0. In this case, θ = 2 does the trick. Now, the output y will be 1 only when x1 and x2 are both 1 (only then will x1 + x2 be greater than or equal to 2). You can play with the value of θ to design the other logic gates. For example, in an OR gate, the output should be 1 if either x1 or x2 is 1; otherwise, the output should be 0. What should θ be?

简单的 MCP 模型可以扩展。您可以增加输入的数量。您可以让输入为“抑制”,这意味着x1x2可以乘以 -1。如果神经元的输入之一是抑制的,并且您适当地设置了阈值,那么神经元将始终输出 0,而不管所有其他输入的值如何。这允许您构建更复杂的逻辑。就像将多个神经元互连,使得一个神经元的输出作为另一个神经元的输入一样。

The simple MCP model can be extended. You can increase the number of inputs. You can let inputs be “inhibitory,” meaning x1 or x2 can be multiplied by -1. If one of the inputs to the neurode is inhibitory and you set the threshold appropriately, then the neurode will always output a 0, regardless of the value of all the other inputs. This allows you to build more complex logic. As does interconnecting multiple neurodes such that the output of one neurode serves as the input to another.

所有这些都令人惊叹,但又有局限性。McCulloch-Pitts (MCP) 神经元是一个计算单位,你可以使用它的组合来创建任何类型的布尔逻辑。鉴于所有数字计算在最基本层面上都是一系列这样的逻辑运算,你可以混合搭配 MCP 神经元来执行任何计算。这在 1943 年是一个非凡的声明。McCulloch 和 Pitts 的论文的数学根源显而易见。该论文只有三处参考文献——卡尔纳普的《语言的逻辑句法》、大卫·希尔伯特和威廉·阿克曼的《理论逻辑基础》以及怀特黑德和罗素的《数学原理》 ——而且它们都与生物学无关。McCulloch-Pitts 论文中得出的严谨结果是毋庸置疑的。然而,结果只是一台可以计算而不是学习的机器。特别是,θ的值必须手工设计;神经元无法检查数据并找出θ。

All this was amazing, and yet limited. The McCulloch-Pitts (MCP) neuron is a unit of computation, and you can use combinations of it to create any type of Boolean logic. Given that all digital computation at its most basic is a sequence of such logical operations, you can essentially mix and match MCP neurons to carry out any computation. This was an extraordinary statement to make in 1943. The mathematical roots of McCulloch and Pitts’s paper were apparent. The paper had only three references—Carnap’s The Logical Syntax of Language; David Hilbert and Wilhelm Ackermann’s Foundations of Theoretical Logic; and Whitehead and Russell’s Principia Mathematica—and none of them had to do with biology. There was no doubting the rigorous results derived in the McCulloch-Pitts paper. And yet, the upshot was simply a machine that could compute, not learn. In particular, the value of θ had to be hand-engineered; the neuron couldn’t examine the data and figure out θ.

难怪罗森布拉特的感知器会引起如此轰动。它能从数据中学习权重。权重编码了一些关于数据模式的知识(无论多么微不足道),并可以这么说,记住它们。

It’s no wonder Rosenblatt’s perceptron made such a splash. It could learn its weights from data. The weights encoded some knowledge, however minimal, about patterns in the data and remembered them, in a manner of speaking.

从错误中学习

LEARNING FROM MISTAKES

罗森布拉特的学识常常让学生们惊叹不已。乔治·纳吉 (George Nagy) 于 1960 年来到纽约州伊萨卡的康奈尔大学 (Cornell University) 跟随罗森布拉特攻读博士学位,他回忆起两人一起散步时的情景,当时他们谈到了立体视觉。罗森布拉特对这个主题的精通令纳吉惊叹不已。“纳吉现在是纽约州特洛伊市伦斯勒理工学院的名誉教授,他说:“和他交谈很难不让人觉得他很幼稚。”罗森布拉特的博学显然因他相对年轻而显得更加突出。(他只比纳吉大十岁。)

Rosenblatt’s scholarship often left his students floored. George Nagy, who came to Cornell University in Ithaca, New York, in 1960 to do his Ph.D. with Rosenblatt, recalled a walk the two of them took, during which they talked about stereo vision. Rosenblatt blew Nagy away with his mastery of the topic. “It was difficult not to feel naïve talking to him in general,” said Nagy, now professor emeritus at Rensselaer Polytechnic Institute in Troy, New York; Rosenblatt’s evident erudition was accentuated by his relative youth. (He was barely ten years older than Nagy.)

罗森布拉特的年轻气盛让他们俩在一次公路旅行中差点惹上麻烦。他和纳吉必须从伊萨卡前往芝加哥参加一个会议。罗森布拉特还没有写好他想要发表的论文,所以他让纳吉开车,而他则在工作。纳吉从未拥有过汽车,几乎不会开车,但他还是答应了。“不幸的是,我同时在几条车道上开车,一名警察拦住了我们,”纳吉说。罗森布拉特告诉警察,他是一名教授,让他的学生开车。“警察笑着说,‘你不是教授,你是学生。’”幸运的是,罗森布拉特身上有足够的证件让警察相信他的证件,警察放了他们两个。罗森布拉特开车去了芝加哥,在那里他整晚都在打字,第二天就发表了论文。“他能做到这些,”纳吉告诉我。

Rosenblatt’s youthfulness almost got the two of them into trouble during a road trip. He and Nagy had to go from Ithaca to Chicago for a conference. Rosenblatt hadn’t yet written the paper he wanted to present, so he asked Nagy to drive while he worked. Nagy had never owned a car and barely knew how to drive, but he agreed nonetheless. “Unfortunately, I drove in several lanes at once, and a policeman stopped us,” Nagy said. Rosenblatt told the cop that he was a professor and had asked his student to drive. “The cop laughed and said, ‘You are not a professor, you are a student.’ ” Fortunately, Rosenblatt had enough papers on him to convince the cop of his credentials, and the cop let the two go. Rosenblatt drove the rest of the way to Chicago, where he stayed up all night typing his paper, which he presented the next day. “He was able to do these things,” Nagy told me.

当纳吉来到康奈尔大学时,罗森布拉特已经建造了 Mark I 感知器;我们在序言中看到,罗森布拉特在 1958 年就建造了感知器,这导致了《纽约时报》的报道。纳吉开始研究下一台机器,称为 Tobermory(以 HH Munro,又名 Saki 创造的会说话的猫命名),这是一个专为语音识别而设计的硬件神经网络。与此同时,Mark I 感知器和罗森布拉特的想法已经引起了广泛关注。

By the time Nagy arrived at Cornell, Rosenblatt had already built the Mark I Perceptron; we saw in the prologue that Rosenblatt had done so in 1958, leading to the coverage in The New York Times. Nagy began working on the next machine, called Tobermory (named after the talking cat created by H. H. Munro, aka Saki), a hardware neural network designed for speech recognition. Meanwhile, the Mark I Perceptron and Rosenblatt’s ideas had already garnered plenty of attention.

1958 年夏天,康奈尔航空实验室《研究趋势》杂志的编辑用整整一期来报道罗森布拉特(该编辑表示,“因为罗森布拉特博士的文章意义非凡”)。这篇文章的标题是《智能自动机的设计:感知器简介——一种像人类大脑一样感知、识别、记忆和响应的机器》。罗森布拉特最终后悔选择“感知器”一词来描述他的工作。纳吉告诉我:“罗森布拉特最大的遗憾之一就是他使用了一个听起来像机器的词。”罗森布拉特所说的“感知器”实际上是指一类用于感知和认知的神经系统模型。

In the summer of 1958, the editor of the Cornell Aeronautical Laboratory’s Research Trends magazine had devoted an entire issue to Rosenblatt (“because of the unusual significance of Dr. Rosenblatt’s article,” according to the editor). The article was titled “The Design of an Intelligent Automaton: Introducing the Perceptron—A Machine that Senses, Recognizes, Remembers, and Responds Like the Human Mind.” Rosenblatt would eventually rue choosing the term “perceptron” to describe his work. “It became one of Rosenblatt’s great regrets that he used a word that sounds like a machine,” Nagy told me. By “perceptron,” Rosenblatt really meant a class of models of the nervous system for perception and cognition.

他对大脑的重视并不令人意外。罗森布拉特曾师从视觉感知领域的巨擘之一詹姆斯·吉布森。他还仰慕麦卡洛克和皮茨,以及加拿大心理学家唐纳德·赫布,后者于 1949 年提出了生物神经元学习的模型——需要明确的是,“学习”在这里指的是学习数据中的模式,而不是我们通常与高级人类认知相关的那种学习。“他总是高度评价他们,”纳吉说。

His emphasis on the brain wasn’t a surprise. Rosenblatt had studied with James Gibson, one of the giants in the field of visual perception. He also looked up to McCulloch and Pitts and to Donald Hebb, a Canadian psychologist who in 1949 introduced a model for how biological neurons learn—to be clear, “learning” here refers to learning about patterns in data and not to the kind of learning we usually associate with high-level human cognition. “He’d always talk highly of them,” Nagy said.

虽然麦卡洛克和皮茨已经开发出神经元模型,但这些人工神经元的网络无法学习。在生物神经元的背景下,赫布曾提出一种学习机制,这种机制通常简洁但有些错误,即“一起激发的神经元会连接在一起”。更准确地说,根据这种思维方式,我们的大脑之所以能够学习,是因为当一个神经元的输出持续参与另一个神经元的激发时,神经元之间的连接就会加强,而当情况并非如此时,连接就会减弱。这一过程被称为赫布学习。罗森布拉特吸收了这些先驱者的研究成果,并将其综合成一个新的想法:人工神经元在学习过程中会重新配置,将信息体现在其连接强度中。

While McCulloch and Pitts had developed models of the neuron, networks of these artificial neurons could not learn. In the context of biological neurons, Hebb had proposed a mechanism for learning that is often succinctly, but somewhat erroneously, put as “Neurons that fire together wire together.” More precisely, according to this way of thinking, our brains learn because connections between neurons strengthen when one neuron’s output is consistently involved in the firing of another, and they weaken when this is not so. The process is called Hebbian learning. It was Rosenblatt who took the work of these pioneers and synthesized it into a new idea: artificial neurons that reconfigure as they learn, embodying information in the strengths of their connections.

作为一名心理学家,罗森布拉特无法获得所需的计算机能力,无法在硬件或软件中模拟他的想法。因此,他借用了康奈尔航空实验室的 IBM 704,这是一台重达 5 吨、房间大小的庞然大物。罗森布拉特的工作引起了物理学家的注意,并因此在心理学期刊和美国物理学会上发表了多篇论文,这次合作硕果累累。罗森布拉特最终制造了 Mark I 感知器。该设备有一个摄像头,可以生成 20x20 像素的图像。当 Mark I 看到这些图像时,它能够识别字母表中的字母。但纳吉表示,说 Mark I “识别”了字符,却没有抓住要点。毕竟,具有相同能力的光学字符识别系统在 20 世纪 50 年代中期就已经投入商业使用。重点是,当马克犯错时,它通过电击学会了识别字母!”纳吉在演讲中说道。

As a psychologist, Rosenblatt didn’t have access to the kind of computer power he needed to simulate his ideas in hardware or software. So, he borrowed time on the Cornell Aeronautical Laboratory’s IBM 704, a five-ton, room-size behemoth. The collaboration proved fruitful when Rosenblatt’s work caught the attention of physicists, resulting in papers in journals of psychology and of the American Physical Society. Rosenblatt eventually built the Mark I Perceptron. The device had a camera that produced a 20x20-pixel image. The Mark I, when shown these images, could recognize letters of the alphabet. But saying that the Mark I “recognized” characters is missing the point, Nagy said. After all, optical character recognition systems, which had the same abilities, were commercially available by the mid-1950s. “The point is that Mark I learned to recognize letters by being zapped when it made a mistake!” Nagy would say in his talks.

但是感知器到底是什么?它是如何学习的?最简单的形式是,感知器是一种增强型 McCulloch-Pitts 神经元,其中注入了学习算法。以下是具有两个输入的示例。请注意,每个输入都与其对应的权重相乘。(还有一个额外的输入b,其原因很快就会清楚。)

But what exactly is a perceptron, and how does it learn? In its simplest form, a perceptron is an augmented McCulloch-Pitts neuron imbued with a learning algorithm. What follows is an example with two inputs. Note that each input is being multiplied by its corresponding weight. (There is also an extra input, b, the reason for which will soon become clear.)

感知器进行的计算如下:

The computation carried out by the perceptron goes like this:

总和= w 1 x 1 + w 2 x 2 + b

sum = w1x1 + w2x2 + b

如果总和> 0:y = 1

If sum > 0: y = 1

否则y = -1

Else: y = -1

更一般地用数学符号来表示:

More generally and in mathematical notation:

与之前介绍的 MCP 模型的主要区别在于,感知器的输入不必是二进制(0 或 1),而是可以取任意值。此外,这些输入乘以相应的权重,因此我们现在有一个加权和。此外,还有一个附加项b,即偏差。输出y为 -1 或 +1(而不是像 MCP 神经元那样为 0 或 1)。至关重要的是,与 MCP 神经元不同,感知器可以学习正确的权重值和偏差值来解决某些问题。

The main difference from the MCP model presented earlier is that the perceptron’s inputs don’t have to be binary (0 or 1), but can take on any value. Also, these inputs are multiplied by their corresponding weights, so we now have a weighted sum. Added to that is an additional term b, the bias. The output, y, is either -1 or +1 (instead of 0 or 1, as in the MCP neuron). Crucially, unlike with the MCP neuron, the perceptron can learn the correct value for the weights and the bias for solving some problem.

要了解其工作原理,请考虑一个感知器,它试图将某人分类为肥胖(y = +1 或不肥胖(y = -1)输入是人的体重(x1)和身高(x2)。假设数据集包含一百个条目,每个条目包含一个人的体重和身高,以及一个标签,表明医生是否根据美国国家心肺血液研究所制定的指导方针认为此人肥胖。感知器的任务是学习w1w2的值以及偏差项 b 的值以便它正确地将数据集中的每个人归类为“肥胖”或“不肥胖”。注意:我们在分析一个人的体重和身高的同时也在讨论感知器的权重(w1w2);在进一步阅读时请记住“体重”一词的这两个不同含义。

To understand how this works, consider a perceptron that seeks to classify someone as obese, y = +1, or not-obese, y = -1. The inputs are a person’s body weight, x1, and height, x2. Let’s say that the dataset contains a hundred entries, with each entry comprising a person’s body weight and height and a label saying whether a doctor thinks the person is obese according to guidelines set by the National Heart, Lung, and Blood Institute. A perceptron’s task is to learn the values for w1 and w2 and the value of the bias term b, such that it correctly classifies each person in the dataset as “obese” or “not-obese.” Note: We are analyzing a person’s body weight and height while also talking about the perceptron’s weights (w1 and w2); keep in mind these two different meanings of the word “weight” while reading further.

一旦感知器学习到了w1w2以及偏差项的正确值,它就可以进行预测了。给定另一个人的体重和身高(此人不在原始数据集中,因此这不是查阅条目表的简单问题),感知器就可以将此人归类为肥胖或非肥胖。当然,这个模型需要一些假设,其中许多假设与概率分布有关,我们将在后续章节中讨论这些假设。但感知器做出了一个基本假设:它假设被归类为肥胖的人和非肥胖的人之间存在明显的线性划分。

Once the perceptron has learned the correct values for w1 and w2 and the bias term, it’s ready to make predictions. Given another person’s body weight and height—this person was not in the original dataset, so it’s not a simple matter of consulting a table of entries—the perceptron can classify the person as obese or not-obese. Of course, a few assumptions underlie this model, many of them to do with probability distributions, which we’ll come to in subsequent chapters. But the perceptron makes one basic assumption: It assumes that there exists a clear, linear divide between the categories of people classified as obese and those classified as not-obese.

在这个简单示例中,如果您将人们的体重和身高绘制在 xy 图上,体重在 x 轴上,身高在 y 轴上,这样每个人都是图上的一个点,那么“明显分界”假设表明,存在一条直线将代表肥胖的点与代表非肥胖的点分开。如果是这样,则数据集被称为线性可分的。

In the context of this simple example, if you were to plot the body weights and heights of people on an xy graph, with weights on the x-axis and heights on the y-axis, such that each person was a point on the graph, then the “clear divide” assumption states that there would exist a straight line separating the points representing the obese from the points representing the not-obese. If so, the dataset is said to be linearly separable.

以下是感知器学习过程中发生情况的图形。我们从两组数据点开始,一组用黑色圆圈表示(y = +1,肥胖),另一组用黑色三角形表示(y = -1,不肥胖)。每个数据点都用一对值(x1,x2)表示,其中x1是人的体重(以千克为单位),沿 x 轴绘制,x2是人的身高(以厘米为单位),沿 y 轴绘制。

Here’s a graphical look at what happens as the perceptron learns. We start with two sets of data points, one characterized by black circles (y = +1, obese) and another by black triangles (y = -1, not-obese). Each data point is characterized by a pair of values (x1, x2), where x1 is the body weight of the person in kilograms, plotted along the x-axis, and x2 is the height in centimeters, plotted along the y-axis.

感知器从其权重w1w2以及初始化为零的偏差开始。权重和偏差表示 xy 平面中的一条线。然后,感知器尝试找到一条由其权重和偏差的一组值定义的分离线,该分离线尝试对点进行分类。一开始,它正确地分类了一些点,而错误地分类了其他点。两次错误的尝试显示为灰色虚线。在这种情况下,您可以看到,在一次尝试中,所有点都位于虚线的一侧,因此三角形被正确分类,但圆圈没有;而在另一次尝试中,它正确地分类了圆圈,但一些三角形分类错误。感知器从错误中吸取教训并调整其权重和偏差。经过多次遍历数据后,感知器最终发现其权重和偏差项的至少一组正确值。它找到了一条描绘集群的线:圆圈和三角形位于相对的两侧。它显示为一条实心黑线,将坐标空间分成两个区域(其中一个区域为灰色阴影)。感知器学习到的权重决定了直线的斜率;偏差决定了直线与原点的距离或偏移量。

The perceptron starts with its weights, w1 and w2, and the bias initialized to zero. The weights and bias represent a line in the xy plane. The perceptron then tries to find a separating line, defined by some set of values for its weights and bias, that attempts to classify the points. In the beginning, it classifies some points correctly and others incorrectly. Two of the incorrect attempts are shown as the gray dashed lines. In this case, you can see that in one attempt, all the points lie to one side of the dashed line, so the triangles are classified correctly, but the circles are not; and in another attempt, it gets the circles correct but some of the triangles wrong. The perceptron learns from its mistakes and adjusts its weights and bias. After numerous passes through the data, the perceptron eventually discovers at least one set of correct values of its weights and its bias term. It finds a line that delineates the clusters: The circles and the triangles lie on opposite sides. This is shown as a solid black line separating the coordinate space into two regions (one of which is shaded gray). The weights learned by the perceptron dictate the slope of the line; the bias determines the distance, or offset, of the line from the origin.

一旦感知器了解了一个人的身体特征(体重和身高)与该人是否肥胖之间的相关性(y = +1 或 -1),就可以给它体重和身高,而训练期间未使用其数据的人,感知器可以告诉您是否应将该人归类为肥胖。当然,现在感知器正在做出最佳预测,因为它已经学习了其权重和偏差,但预测可能是错误的。你能找出原因吗?看看你是否可以通过查看图表来发现问题。(提示:你能画出多少条不同的线来成功将圆圈与三角形分开?)正如我们将看到的,机器学习的大部分内容都归结为最小化预测误差。

Once the perceptron has learned the correlation between the physical characteristics of a person (body weight and height) and whether that person is obese (y = +1 or -1), you can give it the body weight and height of a person whose data weren’t used during training, and the perceptron can tell you whether that person should be classified as obese. Of course, now the perceptron is making its best prediction, having learned its weights and bias, but the prediction can be wrong. Can you figure out why? See if you can spot the problem just by looking at the graph. (Hint: How many different lines can you draw that succeed in separating the circles from the triangles?) As we’ll see, much of machine learning comes down to minimizing prediction error.

上面描述的是一个感知器单元,或者一个人工神经元。这看起来很简单,你可能想知道为什么这么大惊小怪。想象一下,如果感知器的输入数量超过两个:(x1、x2、x3、x4等),每个输入(xi)都有自己的轴。你不能再通过简单的心算来解决问题。一条线已经不足以分隔两个簇,它们现在存在于比二维高得多的维度中。例如,当你有三个点(x1、x2、x3)时,数据是三维的:你需要一个二维平面来分隔数据点。在四维或更高维度中,你需要一个超平面(我们无法用我们的三维思维来可视化)。一般来说,这种一维直线或二维平面的高维等价物称为超平面。

What’s described above is a single perceptron unit, or one artificial neuron. It seems simple, and you may wonder what all the fuss is about. Well, imagine if the number of inputs to the perceptron went beyond two: (x1, x2, x3, x4, and so on), with each input (xi) getting its own axis. You can no longer do simple mental arithmetic and solve the problem. A line is no longer sufficient to separate the two clusters, which now exist in much higher dimensions than just two. For example, when you have three points (x1, x2, x3), the data is three-dimensional: you need a 2D plane to separate the data points. In dimensions of four or more, you need a hyperplane (which we cannot visualize with our 3D minds). In general, this higher-dimensional equivalent of a 1D straight line or a 2D plane is called a hyperplane.

现在回想一下 1958 年。Rosenblatt 用许多这样的单元构建了他的 Mark I 感知器。它可以处理 20x20 像素的图像,总共 400 个像素,每个像素对应一个x输入值。因此,Mark I 将一长串值作为输入:x1、x2、x3、…、x400。人工神经元的复杂排列(既有固定的随机权重,也有可以学习的权重)将这个 400 个值的向量转换为输出信号,可用于辨别图像中的模式。(这是一个过于简单的描述。有些计算非常复杂,以至于需要 IBM 704。我们将在第 10 章中了解架构细节。)Mark I 可以学习对编码在这些像素值中的字母进行分类。刚刚描述的所有逻辑,扩展到处理 400 个输入,都是内置硬件。一旦机器学会了(我们将在下一章中看到如何学会),它就包含了其连接强度(权重)中的知识。难怪每个人都任由自己的想象力自由驰骋。

Now think back to 1958. Rosenblatt built his Mark I Perceptron with numerous such units. It could process a 20x20-pixel image—for a total of 400 pixels, with each pixel corresponding to an x input value. So, the Mark I took as input a long row of values: x1, x2, x3,…, x400. A complex arrangement of artificial neurons, both with fixed, random weights and weights that could be learned, turned this vector of 400 values into an output signal that could be used to discern the pattern in the image. (This is an oversimplified description. Some of the computing was complex enough that it needed an IBM 704. We’ll get a glimpse of the architectural details in chapter 10.) The Mark I could learn to categorize the letters of the alphabet encoded in those pixel values. All the logic just described, scaled up to handle 400 inputs, was built-in hardware. The machine, once it had learned (we’ll see how in the next chapter), contained knowledge in the strengths (weights) of its connections. It’s little wonder that everyone let their imagination run wild.

但是如果你仔细研究感知器的学习内容,它的局限性——当然,事后看来——就会变得显而易见。如果数据中存在相关性,算法可以帮助感知器学习(x1、x2、…、x400)的值与相应的y值之间的相关性。当然,它会在没有明确告知相关性是什么的情况下学习相关性,但这些相关性毕竟是相关性。识别相关性和思考和推理是一回事吗?当然,如果 Mark I 能够区分字母“B”和字母“G”,那它只是按照模式行事,并没有赋予这些字母任何意义以引发进一步的推理。这些问题是现代争论的核心,争论的焦点是深度神经网络的局限性,而深度神经网络是感知器的惊人后代。这些早期的感知器与大型语言模型技术或正在为自动驾驶汽车开发的人工智能之间存在着联系。但这条路并不是笔直的;相反,这条路漫长而曲折,有许多错误的转弯和死胡同。但无论如何,这是一条迷人而有趣的道路,我们现在就踏上这条道路。

But if you closely examine what the perceptron learns, its limitations—in hindsight, of course—become obvious. The algorithm is helping the perceptron learn about correlations between values of (x1, x2,…, x400) and the corresponding value of y, if such correlations exist in the data. Sure, it learns the correlations without being explicitly told what they are, but these are correlations nonetheless. Is identifying correlations the same thing as thinking and reasoning? Surely, if the Mark I distinguished the letter “B” from the letter “G,” it was simply going by the patterns and did not attach any meaning to those letters that would engender further reasoning. Such questions are at the heart of the modern debate over the limits of deep neural networks, the astonishing descendants of perceptrons. There is a path connecting these early perceptrons to the technology of large language models or the AI being developed for, say, self-driving cars. That path is not a straight one; rather, it’s long and winding, with false turns and dead ends. But it’s a fascinating, intriguing path nonetheless, and we are setting off on it now.

构建感知器设备是一项重大成就。更大的成就是数学证明,如果数据是线性可分的,单层感知器将始终找到线性分离的超平面。理解这一证明需要我们首次了解向量,以及它们如何构成机器学习中用于表示数据的方法的支柱。这是我们的第一次数学进站。

Building the perceptron device was a major accomplishment. An even bigger achievement was the mathematical proof that a single layer of perceptrons will always find a linearly separating hyperplane, if the data are linearly separable. Understanding this proof will require us to get our first taste of vectors and how they form the backbone of methods used to represent data in machine learning. It’s our first mathematical pit stop.

第二章

CHAPTER 2

我们在这里都只是数字……

We Are All Just Numbers Here…

1865 年 9 月,爱尔兰数学家威廉·罗恩·汉密尔顿去世前不到一个月给儿子写了一封四段信。信中,汉密尔顿回忆了在爱尔兰都柏林皇家运河散步的情景。那是 1843 年 10 月 16 日,汉密尔顿正前往参加爱尔兰皇家科学院的会议,他的妻子陪在他身边。当这对夫妇来到布鲁厄姆桥下时,十多年来一直在努力解决一些深奥数学问题的汉密尔顿突然灵光一闪。”电路似乎闭合了,火花四溅……当我们经过布鲁厄姆桥时,我禁不住冲动——尽管这可能不太哲学——用刀在桥上的一块石头上刻下基本公式,其中包含符号i、j、k,i 2 = j 2 = k 2 = ijk = -1。”

Less than a month before his death in September 1865, the Irish mathematician William Rowan Hamilton wrote a letter in four paragraphs to his son. In that letter, Hamilton recalled, among other things, a walk along the Royal Canal in Dublin, Ireland. It was October 16, 1843. Hamilton was on his way to attend a meeting of the Royal Irish Academy. His wife was with him. When the couple came underneath the Brougham Bridge, Hamilton, who had been struggling for more than a decade with some deep mathematical questions, had a flash of inspiration. “An electric circuit seemed to close; and a spark flashed forth…I [could not] resist the impulse—unphilosophical as it may have been—to cut with a knife on a stone of Brougham Bridge, as we passed it, the fundamental formula with the symbols, i, j, k; namely, i2 = j2 = k2 = ijk = -1.”

汉密尔顿在给儿子的信中这样签名:“我用这四段话(重点是我加的)结束了这封信……你亲爱的父亲,威廉·罗恩·汉密尔顿。”“四元数”这个词的使用是经过深思熟虑的。四元数是由四个元素组成的数学实体,它们具有非常奇怪和特殊的属性,汉密尔顿在布鲁厄姆桥下那个决定性的日子发现了它。他刻在那块石头上的方程式代表了四元数的一般形式,是最著名的方程式之一数学涂鸦的例子;原来的涂鸦早已被毁坏,取而代之的是一块官方牌匾,上面写着:

Hamilton signed off the letter to his son with these words: “With this quaternion of paragraphs [emphasis mine] I close this letter…Your affectionate father, William Rowan Hamilton.” The use of the word “quaternion” was deliberate. A quaternion is a mathematical entity composed of four elements with very strange and special properties, which Hamilton discovered on that fateful day beneath Brougham Bridge. The equation he etched on the stone there, representing the general form of the quaternion, is one of the most famous examples of mathematical graffiti; the original, which has long since been defaced, was replaced by an official plaque reading:

当他走过这里时

Here as he walked by

1843 年 10 月 16 日

on the 16th of October 1843

威廉·罗恩·汉密尔顿爵士

Sir William Rowan Hamilton

一瞬间发现

in a flash of genius discovered

基本公式

the fundamental formula for

四元数乘法

quaternion multiplication

= = = ijk = -1

=== ijk = -1

并将其刻在这座桥的一块石头上。

& cut it on a stone of this bridge.

四元数是奇异的实体,与我们无关。但为了创建用于操纵四元数的代数,汉密尔顿开发了一些其他数学思想,这些思想已成为机器学习的核心。特别是,他引入了“标量”和“矢量”这两个术语。如今,我们大多数人可能都没有听说过汉密尔顿,但我们直观地熟悉标量和矢量的概念,即使不知道它们的正式定义。下面是快速入门。

Quaternions are exotic entities, and they don’t concern us. But to create the algebra for manipulating quaternions, Hamilton developed some other mathematical ideas that have become central to machine learning. In particular, he introduced the terms “scalar” and “vector.” These days, most of us would likely not have heard of Hamilton, but we are intuitively familiar with the notion of scalars and vectors, even if not their formal definitions. Here’s a quick primer.

假设有个人走了五英里。根据这一陈述,我们唯一能说的关于这个人做了什么的事情是用一个数字来表示的:行走的距离。这是一个标量,一个独立的数字。现在,如果我们被告知这个人向东北方向走了五英里,我们会得到两条信息:距离和方向。这可以用一个向量来表示。因此,向量既有长度(大小),也有方向。在下图中,向量是一个大小为 5 的箭头。

Consider a man who walks five miles. Given that statement, the only thing we can say about what the man did is denoted by a single number: the distance walked. This is a scalar quantity, a stand-alone number. Now, if we were told that the man walked five miles in a northeasterly direction, we would have two pieces of information: the distance and the direction. This can be represented by a vector. A vector, then, has both a length (magnitude) and a direction. In the following graph, the vector is an arrow of magnitude 5.

如果你仔细观察这个向量,你会发现它有两个分量:一个沿 x 轴,另一个沿 y 轴。它相当于表示该男子向正东方向走了四英里,向正北方向走了三英里。表示实际步行的矢量是一个从 (0, 0) 到 (4, 3) 的箭头,给出了方向和距离。矢量的大小只是矢量及其沿 x 轴和 y 轴的分量形成的直角三角形的斜边的长度。因此,矢量的大小或长度等于

If you closely examine the vector, you’ll see that it has two components: one along the x-axis and another along the y-axis. It’s equivalent to saying that the man went four miles in the direction due east and three miles in the direction due north. The vector representing the actual walk is an arrow going from (0, 0) to (4, 3), giving both the direction and the distance. The magnitude of the vector is simply the length of the hypotenuse of the right-angled triangle formed by the vector and its components along the x- and y-axes. So, the vector’s magnitude, or length, is equal to .

在汉密尔顿之前,人们就已经开始用矢量来思考,而不使用正式的方式来表示和操纵它们。例如,到了 17 世纪末,艾萨克·牛顿就已经使用几何方式来思考加速度和力等矢量类实体。牛顿第二运动定律指出,物体所经受的加速度与作用于它的力成正比,并且物体的加速度和力的方向相同。牛顿运动定律的第一个推论在他的《数学原理》中指出,“一个物体在两个力的作用下,可以描述平行四边形的对角线,同时,当两个力分开时,它可以描述它的边。”这是一个关于使用几何来添加两个矢量的陈述,尽管牛顿并没有把这些量称为矢量。

Thinking in terms of vectors, without using formal ways of representing and manipulating them, predates Hamilton. For example, by the late 1600s, Isaac Newton was already using geometric ways of thinking about vector-like entities such as acceleration and force. Newton’s Second Law of Motion says that the acceleration experienced by an object is proportional to the force acting upon it and that the object’s acceleration and the force have the same direction. The first corollary to Newton’s Laws of Motion, in his Principia, states, “A body by two forces conjoined will describe the diagonal of a parallelogram, in the same time that it would describe the sides, by those forces apart.” This is a statement about using geometry to add two vectors, even though Newton didn’t call the quantities vectors.

为了理解向量加法,我们可以回顾一下那个走了五英里,用从 (0, 0) 到 (4, 3) 的向量表示。到达目的地后,该男子继续向北转,在坐标平面上,他到达了 (6, 9):他实际上向正东方向又走了两英里,向正北方向又走了六英里。这用第二个向量表示,即从 (4, 3) 到 (6, 9) 的箭头。这个新向量的 x 分量为 2,y 分量为 6。该男子步行的总距离是多少?从原点到最终目的地,在 xy 坐标空间中的净距离是多少?此图显示了这两个问题的答案:

To understand vector addition, we can go back to our man who walked five miles, represented by a vector going from (0, 0) to (4, 3). After reaching the destination, the man turns more northward such that in the coordinate plane, he reaches (6, 9): He has effectively walked two more miles in the direction due east and six more miles in the direction due north. This is represented by a second vector, an arrow drawn from (4, 3) to (6, 9). This new vector has an x component of 2 and a y component of 6. What is the total distance the man walked? And what is the net distance in the xy coordinate space, from origin to the final destination? This graph shows you the answers to both:

两个独立矢量(或步行)的大小分别为。因此,该男子步行的总距离为 5 + 6.32 = 11.32 英里。

The magnitude of the two individual vectors, or walks, is and . So, the total distance the man walks is 5 + 6.32 = 11.32 miles.

合成向量是一个从原点到最终目的地(即 (6, 9))的箭头,其幅值为。xy 坐标空间中的净距离为 10.82 英里。

The resultant vector is an arrow drawn from the origin to the final destination, which is (6, 9), and its magnitude is . The net distance in the xy coordinate space is 10.82 miles.

这现在有助于我们理解牛顿所说的话。假设一个力作用于物体引起的加速度是由矢量 (2, 6) 给出,而另一个力对同一物体引起的加速度由矢量 (4, 3) 给出。两个力同时作用于物体。物体的总加速度是多​​少?根据牛顿推论,几何解释涉及绘制平行四边形,如下图所示;然后,净加速度由对角矢量 (6, 9) 给出:

This now helps us make sense of what Newton was saying. Let’s say the acceleration caused by one force acting upon an object is given by the vector (2, 6) and that the acceleration caused by another force on the same object is given by the vector (4, 3). Both forces are acting on the object at the same time. What is the total acceleration of the object? According to Newton’s corollary, the geometric interpretation involves drawing a parallelogram, as shown in the following figure; the net acceleration, then, is given by the diagonal vector (6, 9):

如果加速度的单位是米/秒/秒 (m/s 2 ),则净加速度由矢量 (6, 9 ) 的大小给出,该值等于 10.82 m/s 2,沿箭头方向。

If the acceleration is in units of meters per second per second (m/s2), then the net acceleration is given by the magnitude of the vector (6, 9), which equals 10.82 m/s2, in the direction of the arrow.

在这种情况下,我选择添加与男人行走示例中相同的向量,但此处的两个向量表示加速度,而不是距离,并且它们的尾部都位于 (0, 0)。这告诉您,无论向量 (2, 6) 的尾部位于 (0, 0) 还是 (4, 3)(如上例所示),它都是同一个向量。向量的一个重要属性是,我们可以在坐标空间中移动表示向量的箭头,如果我们不改变箭头的长度和方向,它就是同一个向量。为什么?因为我们没有改变它的长度或方向,这两个属性定义了向量。

I have chosen to add the same vectors in this case as in the example of the man walking, but here the two vectors represent acceleration, not distance, and they both have their tails at (0, 0). What this tells you is that the vector (2, 6) is the same vector regardless of whether its tail is at (0, 0) or at (4, 3), as in the previous example. An important property of vectors is that we can move the arrow representing a vector in the coordinate space, and if we don’t change the length of the arrow and its orientation, it’s the same vector. Why? Well, because we haven’t changed its length or its direction, the two properties that define the vector.

当牛顿于 1687 年发表《数学原理》时,这些都还没有被正式理解为矢量分析的开端。然而,他的同代人戈特弗里德·威廉·莱布尼茨 (1646-1716) 对这种新的思维方式却知之甚少。1679 年,在写给另一位杰出的同代人克里斯蒂安·惠更斯的信中,莱布尼茨写道:“我相信我已经找到了方法……我们可以用字符来表示图形,甚至机器和运动,就像代数表示数字或量级一样。”莱布尼茨从未完全形式化他的直觉,但他的先见之明——当我们理解向量对机器学习的重要性时,我们就会明白——是惊人的。继莱布尼茨之后,许多其他数学家,包括约翰·卡尔·弗里德里希·高斯(1777-1855),开发了在二维中几何表示某些类型数字的方法,为汉密尔顿发现四元数和形式化矢量分析奠定了基础。

None of this was formally understood as the beginnings of vector analysis when Newton published his Principia in 1687. His contemporary Gottfried Wilhelm Leibniz (1646–1716), however, had more than an inkling about this new way of thinking. In 1679, in a letter to another luminous contemporary, Christiaan Huygens, Leibniz wrote, “I believe that I have found the way…that we can represent figures and even machines and movements by characters, as algebra represents numbers or magnitudes.” Leibniz never quite formalized his intuition, but his prescience—as we’ll see when we understand the importance of vectors for machine learning—was astounding. Following Leibniz, a host of other mathematicians, including Johann Carl Friedrich Gauss (1777–1855), developed methods for the geometric representation of certain types of numbers in two dimensions, setting the stage for Hamilton’s discovery of quaternions and the formalization of vector analysis.

用数字来表示载体

VECTORS BY THE NUMBERS

矢量分析不一定是几何的。它可以归结为操作以特定格式书写的数字。事实上,对于机器学习,我们需要这样思考矢量。例如,上例中两个力引起的加速度只是两个数字的数组,分别为 [4, 3] 和 [2, 6]。将它们相加与将每个矢量的各个分量相加(垂直堆叠,作为一列)相同。你不必为箭头而烦恼:

Vector analysis doesn’t have to be geometric. It can come down to manipulating numbers written in a certain format. And in fact, for machine learning, that’s how we need to think about vectors. For example, the accelerations caused by the two forces in the previous example are simply arrays of two numbers each, [4, 3] and [2, 6], respectively. Adding them is the same as adding the individual components of each vector (stacked vertically, as a column). You don’t have to fuss with arrows:

减去向量也是类似的。

Subtracting vectors is similar.

刚刚发生了什么?为什么合成矢量的 y 分量是负数?如果这些数字仍然代表加速度,那么减法意味着第二个力作用于第一个力;沿 x 轴,加速度比我们将两个矢量相加时略小,但仍为正;然而,沿 y 轴,力现在作用于初始运动方向,导致减速。

What just happened? Why is the y component of the resultant vector negative? If these numbers still represent acceleration, then subtraction meant that the second force was acting against the first force; along the x-axis, the acceleration is just a little bit less than when we were adding the two vectors, but it is still positive; along the y-axis, however, the force is now acting against the initial direction of motion, resulting in a deceleration.

可以将向量乘以标量——只需将向量的每个元素乘以标量即可。

One can multiply a vector by a scalar—simply multiply each element of the vector by the scalar.

从几何学上讲,这相当于将箭头(或矢量)沿同一方向拉伸五次。原始矢量的幅度为 5。将其缩放 5 倍可得到新的幅度 25。如果您使用放大后的坐标计算新矢量的幅度,您将再次得到:

Geometrically, that’s the same as stretching the arrow (or vector) five times in the same direction. The magnitude of the original vector is 5. Scaling it 5 times gives us a new magnitude of 25. If you were to calculate the magnitude of the new vector using its scaled-up coordinates, you’d again get:

还有另一种表示向量的方法。将我们自己限制在二维中,想象一个长度为 1 的向量i,沿 x 轴,另一个长度为 1 的向量j,沿 y 轴。请注意,ij为小写和粗体;这表示它们是向量。因此,i可以被认为是从 (0, 0) 指向 (1, 0) 的箭头,j则是从 (0, 0) 指向 (0, 1) 的箭头。每个向量的量级为 1,也称为单位向量。鉴于此,向量 (4, 3) 和 (2, 6)在笛卡尔坐标系中,可以分别写成 4 i + 3 j和 2 i + 6 j。这相当于说向量 (4, 3) 沿 x 轴方向为 4 个单位,沿 y 轴方向为 3 个单位,向量 (2, 6) 沿 x 轴方向为 2 个单位,沿 y 轴方向为 6 个单位。使用ij是表示向量的简写。还需要指出的是,单位向量只是一个大小为 1 的向量;它不必位于某个坐标空间的垂直轴上。

There’s yet another way to represent vectors. Restricting ourselves to two dimensions, think of a vector of length one, i, along the x-axis and a vector of length one, j, along the y-axis. Note that i and j are in lowercase and boldface; this signifies that they are vectors. So, i can be thought of as an arrow that points from (0, 0) to (1, 0) and j as an arrow that points from (0, 0) to (0, 1). Each has a magnitude of 1 and is also called a unit vector. Given this, the vectors (4, 3) and (2, 6), in Cartesian coordinates, can be written as 4i + 3j and 2i + 6j, respectively. That’s the same as saying that the vector (4, 3) is 4 units along the x-axis and 3 units along the y-axis and that the vector (2, 6) is 2 units along the x-axis and 6 units along the y-axis. The use of i and j is shorthand for representing vectors. It’s also important to point out that a unit vector is simply a vector with a magnitude of 1; it doesn’t have to lie along the perpendicular axes of some coordinate space.

这些想法也适用于更高维度,我们稍后会谈到这一点。目前,掌握二维向量的数学运算及其相应的几何意义将大大有助于我们理解其高维对应物在机器学习中的作用。

These ideas apply to higher dimensions, too, and we’ll come to that. For now, getting a handle on the mathematical manipulation of 2D vectors and their corresponding geometric meanings will go a long way toward helping us understand the role of their higher-dimensional counterparts in machine learning.

点积

THE DOT PRODUCT

向量的另一个重要运算是点积。假设向量 (4, 0) 为a,向量 (5, 5) 为b。(再次强调,粗体和小写的字母ab表示它们是向量。)从概念上讲,点积ab 读作“a 点 b”)定义为a的量级乘以b在a上的投影,其中投影可以被认为是“一个矢量到另一个矢量上的“投射阴影”。

Another important operation with vectors is something called the dot product. Consider vector (4, 0), call it a, and vector (5, 5), call it b. (Again, the boldface and lowercase for letters a and b signify that they are vectors.) Conceptually, the dot product a.bread that as “a dot b”— is defined as the magnitude of a multiplied by the projection of b onto a, where the projection can be thought of as the “shadow cast” by one vector onto another.

a的大小用 表示。b在 a 上投影b的大小(即)乘以两个向量之间角度的余弦得出。对于我们选择的向量,它们之间的角度为 45 度(即,如上图所示)。因此:

The magnitude of a is denoted by . The projection of b onto a is given by the magnitude of b, or , multiplied by the cosine of the angle between the two vectors. For the vectors we have chosen, the angle between them is 45 degrees (or , as shown in the preceding graph. So:

注意:符号⇒表示“这意味着”。

Note: the symbol ⇒ means “which implies that.”

现在我们来做一些小调整。假设向量a为 (1, 0),向量b为 (3, 3)。向量a的量级为 1,因此它是一个“单位向量”。现在,如果取点积ab,则会得到:

Now let’s make a couple of small tweaks. Let the vector a be given by (1, 0), vector b by (3, 3). Vector a has a magnitude of 1, so it’s a “unit vector.” Now, if you were to take the dot product a.b, you’d get:

= 3

= 3

点积等于向量b的 x 分量,或者说b在 x 轴上投射的阴影,即单位向量。这给了我们一个至关重要的几何直觉:如果点积中涉及的向量之一长度为 1,则点积等于另一个向量在单位长度向量上的投影。在我们的特殊情况下,单位向量位于 x 轴上,因此向量b在 x 轴上的投影只是其 x 分量 3。

The dot product turns out to be equal to the x component of vector b, or the shadow cast by b onto the x-axis, the direction of the unit vector. This gives us a crucial geometric intuition: If one of the vectors involved in a dot product is of length 1, then the dot product equals the projection of the other vector onto the vector of unit length. In our special case, the unit vector lies along the x-axis, so the projection of vector b onto the x-axis is simply its x component, 3.

但是点积却有令人惊奇的地方。即使单位向量不沿着某个轴,这个几何真理仍然成立。假设a。其大小为 1,因此它是一个单位向量,但它与 x 轴成 45 度角。假设b是向量 (1, 3)。点积ab,等于,而 又是向量b在沿向量a延伸的直线上的投影(见下图)。

But here’s something amazing about dot products. Even if the unit vector is not along one of the axes, this geometric truth still holds. Let’s say a is . Its magnitude is 1, so it’s a unit vector, but it’s at a 45-degree angle to the x-axis. Let’s say b is the vector (1, 3). The dot product a.b is , which equals, which in turn is the projection of the vector b onto the straight line that extends along vector a (see figure below).

点积告诉我们关于两个向量的另一个重要信息是它们是否彼此成直角或正交。如果它们成直角,则 (90°) 的余弦等于零。因此,无论向量的长度如何,它们的点积或向量b在向量a上的投影始终为零。相反,如果两个向量的点积为零,则它们彼此正交。

Another important thing the dot product tells us about two vectors is whether they are at right angles, or orthogonal, to each other. If they are at right angles, then cosine of (90°) equals zero. So, regardless of the length of the vectors, their dot product, or the projection of vector b onto vector a, is always zero. Conversely, if the dot product of two vectors is zero, they are orthogonal to each other.

如果我们使用表示向量的另一种方法是使用它们的分量,而我们不知道两个向量之间的角度?

How would we calculate the dot product if we were to use the other method for representing vectors, using their components, and we didn’t know the angle between the two vectors?

假设a = a 1 i + a 2 jb = b 1 i + b 2 j。那么:

Say, a = a1i + a2j and b = b1i + b2j. Then:

ab = ( a 1 i + a 2 j ).( b 1 i + b 2 j ) = a 1 b 1 × ii + a 1 b 2 × ij + a 2 b 1 × ji + a 2 b 2 × jj

a.b = (a1i + a2j).(b1i + b2j) = a1b1 × i.i + a1b2 × i.j + a2b1 × j.i + a2b2 × j.j

请注意,等式中的第二项和第三项最终为零。向量ij是正交的,因此ijji为零。此外,iijj都等于 1。我们剩下的只是一个标量:

Note that the second and third terms in the equation turn out to be zero. The vectors i and j are orthogonal, so i.j and j.i are zero. Also, both i.i and j.j equal 1. All we are left with is a scalar quantity:

ab = a 1 b 1 + a 2 b 2

a.b = a1b1 + a2b2

机器和载体

MACHINES AND VECTORS

如果感觉这一切与机器学习、感知器和深度神经网络相去甚远,请放心,事实并非如此。这是故事的核心。我们正在飞速前进,但只小心翼翼地踩着稳固立足所必需的石头。

If all this feels far removed from machine learning, perceptrons, and deep neural networks, rest assured it’s not. It’s central to the plot. And we are getting there, by leaps and bounds, yet by stepping carefully only on the stones necessary for sure footing.

现在是时候重新审视感知器,并从向量的角度来思考它。目的是获得数据点如何几何化的洞察力感知器的权重可以表示为向量,以及如何可视化感知器试图找到将数据点划分为两个簇的线性分离超平面时发生的情况。正如我们将看到的,其中很大一部分与使用向量的点积来查找数据点与超平面的相对距离有关。

It’s time to revisit the perceptron and think of it in terms of vectors. The intent is to gain geometric insights into how data points and the weights of a perceptron can be represented as vectors and how to visualize what happens when a perceptron tries to find a linearly separating hyperplane that divides the data points into two clusters. Much of it has to do with using dot products of vectors to find the relative distances of the data points from the hyperplane, as we’ll see.

回想一下感知器的一般方程,它表示如果感知器的输入加上某个偏差项b 的加权和大于 0,则感知器输出 1;否则,感知器输出 -1。

Recall the generic equation for a perceptron, which says that the perceptron outputs 1 if the weighted sum of its inputs plus some bias term, b, is greater than 0; otherwise, it outputs -1.

我们对之前使用的符号做了细微的更改:函数g的参数现在是一个向量;在上一章中,因为我们还没有引入向量的概念,所以我们只是用g ( x ) 而不是g ( x )。让我们继续讨论二维的情况,数据点由 ( x1, x2 ) 的不同值给出,感知器的权重由 ( w1, w2 ) 给出。感知器首先计算输入的加权和:

We have made a subtle change to the notation we used previously: The argument to the function g is now a vector; in the previous chapter, because we hadn’t yet introduced the notion of vectors, we simply had g(x) instead of g(x). Let’s stick to the two-dimensional case, with data points given by different values for (x1, x2) and the weights of the perceptron given by (w1, w2). The perceptron first computes the weighted sum of the inputs:

宽度1 x 1 +宽度2 x 2

w1x1 + w2x2

如果这个加权和大于某个阈值,称之为 -b 那么感知器的输出y就是 1。否则就是 -1。所以:

If this weighted sum is greater than some threshold, call it -b, then the perceptron’s output, y, is 1. Else it is -1. So:

这可以重写为:

This can be rewritten as:

让我们戴上向量帽。权重集 ( w1, w2 ) 只不过是一个向量w 。但是w到底代表什么?

Let’s put on our vectors hat. The set of weights (w1, w2) is nothing but a vector w. But what exactly does w represent?

上图显示了一个权重向量w = (2.5, 1.5)。它还显示了一个同方向的单位向量u。虚线给出了两个向量所在的方向。让我们画一条垂直于向量wu 的黑色实线。这条线将阴影区域与坐标空间的其余部分分开。因此,如果我们试图找到一条将 xy 平面清楚地划分为两个区域(阴影区域和非阴影区域)的线,我们需要指定这样的边界就是向量w或相应的单位向量u。如果上图中的实线或边界是分离超平面,则向量w与其正交并表征该超平面。当我们划分二维空间时,边界是一条线;划分三维体积时,边界是平面;在更高维度上,边界是超平面。

The figure above shows a weight vector, w = (2.5, 1.5). It also shows a unit vector in the same direction, u. The dashed line gives us the direction along which the two vectors lie. Let’s draw a solid black line perpendicular, or orthogonal, to vectors w and u. This line separates the shaded area from the rest of the coordinate space. So, if we were trying to find a line that clearly delineated the xy plane into two regions, shaded and unshaded, all we would need to specify such a boundary would be the vector w, or the corresponding unit vector, u. If the solid line, or boundary, in the preceding figure is the separating hyperplane, then the vector w is orthogonal to it and characterizes that hyperplane. The boundary is a line when we are dividing 2D space, a plane when partitioning a 3D volume, and a hyperplane in higher dimensions.

我们之前对感知器学习算法的观察表明,它试图找到一个将坐标空间一分为二的超平面。因此,感知器会找到或学习一组合适的权重。这些权重构成一个向量w,它与超平面正交。当你改变感知器的权重时,你改变了w的方向,因此你改变了超平面的方向,它总是垂直于w 。而w的正确性也适用于位于同一方向的单位向量u。因此,重新表述感知器所做的事情的一种方式是说它找到了向量w,这与说它找到了相应的垂直超平面是一样的。

Our earlier look at the perceptron learning algorithm showed that it tries to find a hyperplane that divides the coordinate space into two. So, the perceptron finds, or learns, an appropriate set of weights. These weights constitute a vector, w, which is orthogonal to the hyperplane. As you change the weights of the perceptron, you change the direction of w, and so you change the orientation of the hyperplane, which is always perpendicular to w. And what’s true of w is also true of the unit vector u that lies in the same direction. So, one way of rephrasing what the perceptron does is to say that it finds the vector w, which is the same as saying that it finds the corresponding perpendicular hyperplane.

现在考虑位于或不位于阴影区域的数据点。每个数据点由 ( x1, x2 ) 给出,也可以被视为一个向量。那么加权和 ( w 1 x 1 + w 2 x 2) 与表示数据点的向量与权重向量的点积相同。请注意,如果数据点位于超平面上(在二维情况下只是一条线),则向量 ( x1, x2 ) 将与w正交,使点积等于零。下面是数据点和权重向量点积的图形。为方便起见,我们将使用单位长度的权重向量。它在概念上不会改变任何东西,但它确实简化了数学。让我们从数据点 (3, 1) 给出的向量a开始。

Now consider the data points that either lie or don’t lie in the shaded area. Each data point is given by (x1, x2) and can be thought of as a vector, too. Then the weighted sum (w1x1 + w2x2) is the same as the dot product of the vector representing the data point with the weight vector. Note that if the data point lies on the hyperplane, which in the 2D case is just a line, then the vector (x1, x2) will be orthogonal to w, making the dot product equal zero. below is a graphical look at the dot product of data points and the weight vector. For convenience, we’ll work with a weight vector of unit length. It doesn’t change anything conceptually, but it does simplify the math. Let’s start with vector a, given by data point (3, 1).

因为w是单位向量,所以它与 a 的点积等于a虚线上的投影。a在超平面垂直线上的点就是 (3, 1) 点与超平面距离的度量。

Because w is a unit vector, its dot product with a equals the projection of a onto the dashed line. The point at which a lands on the line perpendicular to the hyperplane is a measure of the distance of the point (3, 1) from the hyperplane.

接下来,让我们看一下这个有点复杂但重要的描述,即权重向量与四个不同数据点或向量a (3, 1)、b (2, -1)、c (-2, 1) 和d (-1, -3) 的点积。

Next, let’s look at this somewhat busy but important depiction of the dot product of the weight vector with four different data points, or vectors, a (3, 1), b (2, -1), c (-2, 1), and d (-1, -3).

很明显,每个向量与w的点积会告诉您有关该向量的一些信息:它与超平面的距离以及它是在超平面的一侧(点积为 +ve)还是在另一侧(-ve)。在这种情况下,点ab与点cd线性分离。(位于灰色阴影区域中的点表示y = 1;位于非阴影区域中的点和位于分界线上的点表示y = -1。)

It’s clear that the dot product of each vector with w is telling you something about that vector: its distance from the hyperplane and whether it’s to one side of it (dot product is +ve) or to the other (-ve). In this scenario, points a and b are linearly separated from points c and d. (Points lying in the gray shaded area represent y = 1; points lying in the unshaded area and points lying on the dividing line itself represent y = -1.)

因此,假设感知器第一次尝试找到如上所示的权重和超平面。但是,我们还假设根据我们标记的训练数据,点abc应该位于超平面的一侧,而只有点d位于另一侧。为了论证的目的,我们假设abc代表被归类为喜欢惊悚电影的人;d代表不喜欢惊悚电影的人。这意味着感知器尚未找到正确的超平面。一个惊悚片爱好者被归类为惊悚片讨厌者。这就是偏差项出现的地方。在方程中添加偏差项等同于将超平面移离原点,但不改变其方向。例如,在迭代训练数据后,感知器可以找到这个超平面:

So, let’s say the perceptron, at first attempt, finds the weights and the hyperplane as depicted above. But let’s also say that according to our labeled training data, points a, b, and c should have been on one side of the hyperplane and only point d on the other. For argument’s sake, let’s say that a, b, and c represent people classified as those who like thriller movies; d represents a person who doesn’t. This means that the perceptron hasn’t yet found the correct hyperplane. One thriller lover has been classified as a thriller hater. That’s where the bias term comes in. Adding a bias term to the equation is the same as moving the hyperplane away from the origin, but without changing its orientation. For example, after iterating through the training data, the perceptron could have found this hyperplane:

通过查看上图可以清楚地看出,如果数据可以线性分为两个簇,那么存在许多分离超平面(对于不同的偏置项值和不同的w方向)。只需考虑一下您可以绘制多少条穿过cd之间的空间的直线:原则上,无限多。感知器只保证它会找到一个,但不一定是最好的一个。我们将在后面进一步讨论“最佳”的含义细节,但它与预测有关。毕竟,感知器正在学习权重和偏差项,以便根据超平面对一些尚未看到的数据点进行分类。例如,给定某人的两个特征,我们将其归类为惊悚片爱好者或惊悚片憎恨者,那么该人必须在超平面的哪一侧才能被归类为其中一种或另一种?一个好的或最好的超平面将最小化未来的预测误差。(定义“未来”预测误差,更不用说最小化它,是一个不平凡的问题,或者说不是一个容易的问题。)

It’s clear by looking at the figure above that if the data are linearly separable into two clusters, then there exist many, many separating hyperplanes (for different values of the bias term and different orientations of w). Just think of how many straight lines you can draw that go through the space between c and d: in principle, infinitely many. The perceptron guarantees only that it’ll find one, and not necessarily the best one. We’ll come to what “best” means in more detail, but it has to do with prediction. After all, the perceptron is learning the weights and the bias term in order to classify some as-yet-unseen data point with respect to the hyperplane. For example, given two characteristics of some person that we are using to classify that person as a thriller lover or a thriller hater, on which side of the hyperplane would the person have to be to be classified as one or the other? A good or best possible hyperplane will minimize future prediction errors. (Defining a “future” prediction error, let alone minimizing it, is a nontrivial, or not an easy, problem.)

这些图表是一种直观地了解感知器学习时发生了什么的方式。如果你尝试编写一个计算机程序来模拟感知器,你不会绘制图表和图形。你会操纵数字。幸运的是,到目前为止我们所看到的向量的数值表示已经足以展示这些抽象的威力。在我们的 2D 示例中,数据点 ( x1, x2 ) 只是数字数组,每个数组都有两个元素。权重向量同样是另一个由两个数字组成的数组。找到点积就是操纵这些数组的问题。

These graphs are a way of developing an intuitive sense of what was happening when a perceptron learned. If you were to try to write a computer program to simulate a perceptron, you wouldn’t be drawing charts and graphs. You’d be manipulating numbers. Fortunately, the numerical representations we have seen of vectors so far are already enough to showcase the power of these abstractions. In our 2D example, data points (x1, x2) are just arrays of numbers, each array with two elements. The weight vector is similarly another array of two numbers. Finding the dot product is a matter of manipulating these arrays.

更通俗地说,这些数组被称为矩阵,其中包含行和列的数字。例如,如果有m行和n列,那么我们就有所谓的m × n矩阵(读作“m x n 矩阵”)。向量是一种特殊形式的矩阵,只有一行或一列:m =1 或n =1。我们之前见过这些,只是当时还没有介绍“矩阵”这个术语。但这就是向量:只有一列或一行的矩阵。以下是将两个单列矩阵相加得到第三个单列矩阵的示例。

More generically, these arrays are called matrices, which contain rows and columns of numbers. For example, if there are m rows and n columns, then we have what’s called an m×n matrix (read as an “m by n matrix”). A vector is a particular form of matrix with either one row or one column: either m=1 or n=1. We saw these earlier, only the term “matrix” hadn’t yet been introduced. But that’s what vectors are: matrices with just one column or one row. Here’s the example of adding two one-column matrices to get a third one-column matrix.

将其中一个单列矩阵翻转,会得到一个只有一行的矩阵:

Flip one of the one-column matrices on its side, and you get a matrix with a single row:

因此,在正式符号中,具有两个元素的单列矩阵由以下公式给出:

So, in formal notation, a one-column matrix with two elements is given by:

符号表示列矩阵有两行(索引为 1 和 2),每行只有一个元素(索引为 1)。当您翻转矩阵时,编号会发生变化。(请注意,行索引为 1,而列索引为 1 和 2。)

The notation says that the column matrix has two rows (indexed by numbers 1 and 2) and that each row has just one element (index 1). When you flip the matrix on its side, the numbering changes. (Note that the row index is 1, while the columns have indices 1 and 2.)

这称为对矩阵进行“转置”。 (对于列矩阵来说,这看起来微不足道,或者说很容易,但对于高阶矩阵来说,就复杂得多,我们将在后面的章节中讨论。)转置是计算两个列矩阵点积的一个关键方面。我们将使用粗体大写字母来表示矩阵。 令A为列矩阵,令B为列矩阵。 不能对两个列矩阵进行点积。这是因为要求点积,第一个矩阵的列数必须等于第二个矩阵的行数。所以,在我们的例子中,其中一个必须被转置。A的转置写为A T。点积AB写为A T BB T A。(在这种情况下,它们是同一个东西。)

This is called taking the “transpose” of a matrix. (It looks trivial, or easy, for a column matrix, but it does get more involved for higher-order matrices, which we’ll come to in later chapters.) Taking the transpose is a key aspect of calculating the dot product between two column matrices. We’ll use boldface capital letters to signify matrices. Let A be the column matrix and B the column matrix . You cannot take a dot product of two column matrices. That’s because to take the dot product, the number of columns in the first matrix must equal the number of rows in the second one. So, in our case, one of them must be transposed. The transpose of A is written as AT. The dot product A.B is written as ATB or BTA. (They are one and the same thing in this case.)

请注意,如果您以单位向量ij的形式写出向量,则这正是您得到的值。如果a = a 11 i + a 12 jb = b 11 i + b 21 j,则:

Note that this is exactly the value you’d get if you wrote the vectors in terms of their unit vectors i and j. If a = a11i + a12j and b = b11i + b21j, then:

ab = ( a 11 i + a 12 j ).( b 11 i + b 21 j )
a.b = (a11i + a12j) . (b11i + b21j)
ab = a 11 b 11 × ii + a 11 b 21 × ij + a 12 b 11 × ji + a 12 b 21 × jj
a.b = a11b11 × i.i + a11b21 × i.j +a12b11 × j.i + a12b21 × j.j
ab = a 11 b 11 + a 12 b 21
a.b = a11b11 + a12b21

使用矩阵而不是箭头来表示向量还有另一个很酷的地方:您可以只操作数字并获取点积的标量值,而不必担心它们之间角度的余弦。这意味着如果您有一堆数据点,每个数据点都由一个向量表示,并且您想要找到它们与以权重向量 w 为特征的超平面的相对距离,您所要做的就是将每个数据点与w进行点积,这样您就会获得必要的信息。

Here’s another cool thing about using matrices, rather than arrows, to represent vectors: You can just manipulate the numbers and get a scalar value for the dot product without worrying about the cosine of the angle between them. What this means is that if you have a bunch of data points, each represented by a vector, and you want to find their relative distances from a hyperplane characterized by a weight vector w, all you do is take the dot products of each data point with w, and you’ll have the necessary information.

如果其中一个数据点位于超平面上,则它与权重向量的点积将为零,表示该数据点与权重向量正交,并且它与超平面的距离为零。

And if one of the data points is on the hyperplane, its dot product with the weight vector would be zero, signifying that the data point is orthogonal to the weight vector and that its distance from the hyperplane is zero.

整合起来

PUTTING IT TOGETHER

所有这些都为感知器提供了一种相当优雅的简写符号。

All this is leading up to a rather elegant shorthand notation for the perceptron.

考虑输入 [ x1, x2,…. , xn ]。您可以将其写为列向量x。类似地,权重(对每个输入使用一个权重)[ w1, w2,…. , wn ] 是列向量w 。请注意,我们在符号上做了另一个微妙的变化:我们使用方括号 [] 来包含wx的元素,而不是圆括号 (),以表示wx是矩阵或向量。

Consider inputs [x1, x2,…. , xn]. You can write this as a column vector x. Similarly, the weights, using one weight for each input, [w1, w2,…. , wn], are the column vector w. Note that we have made another subtle shift in notation: We use square brackets to contain the elements of w and x, [], instead of parentheses, (), to signal that w and x are matrices or vectors.

我们知道感知器的输出涉及计算加权和w1x1 + w2x2 + ··· + wnxn。这可以更简洁地写成wx的点积,或w T x。鉴于此,感知器的作用如下:

We know that the output of a perceptron involves calculating the weighted sum w1x1 + w2x2 + ··· + wnxn. This is more concisely written as the dot product of w and x, or wTx. Given this, here’s what a perceptron does:

通过图形,我们再看一下具有两个输入和两个权重的感知器:

Pictorially, let’s look again at the perceptron with two inputs and two weights:

偏差项看起来不协调。有一个巧妙的小技巧可以将其纳入权重向量(见下图)。

The bias term looks incongruous. There’s a neat little trick to subsume it into the weight vector (see first figure below).

在此描述中,偏差项b等于权重w0并乘以x0。但是,x0始终设置为 1,确保偏差b始终添加到其他输入的加权和中。权重向量w现在由 [ w0, w1, w2 ]给出。输入向量x等于 [ x0, x1, x2 ],其中x0 =1

In this depiction, the bias term b is equal to the weight w0 and is multiplied by x0. However, x0 is always set to 1, ensuring that the bias b is always added to the weighted sum of the other inputs. The weight vector, w, is now given by [w0, w1, w2]. The input vector, x, equals [x0, x1, x2], where x0=1.

通用感知器,对于输入向量x = [ x0 , x1, x2,…, xn ]权重向量w = [ w0, w1, w2,…, wn ] 如上图所示。

The generic perceptron, for input vector x = [x0, x1, x2,…, xn] and a weight vector w = [w0, w1, w2,…, wn] looks like the figure directly above.

感知器的方程看起来更简单:

The equation of the perceptron looks even simpler:

让我们把这个等式刻在脑海里。它简单而有力地陈述了这些事实:权重向量w垂直于将数据点分成两个簇的线或超平面。对于一个点簇,w T x小于或等于零,感知器的输出为 -1。对于另一个点簇,w T x大于零,感知器的输出为 1。位于超平面上的点(由w T x = 0 给出)被分配给标签为y = -1 的簇。从机器学习从另一个角度来看,感知器的任务是学习权重向量,给定一组输入数据向量,使得权重向量表示将数据分成两个簇的超平面。一旦感知器学习了权重向量,然后被赋予一个新的数据点进行分类(例如,分类为“肥胖”或“不肥胖”),感知器只需计算新数据实例的w T x,看看它是落在超平面的一侧还是另一侧,然后相应地对其进行分类。

Let’s burn this equation into our mind’s eye. It’s a simple, eloquent statement of these facts: The weight vector w is perpendicular to the line, or hyperplane, that separates the data points into two clusters. For one cluster of points, wTx is less than or equal to zero, and the output of the perceptron is -1. For the other cluster of points, wTx is greater than zero, and the output of the perceptron is 1. The points that lie on the hyperplane (given by wTx = 0) are assigned to the cluster with the label y = -1. From a machine learning perspective, the task of a perceptron is to learn the weight vector, given a set of input data vectors, such that the weight vector represents a hyperplane that separates the data into two clusters. Once it has learned the weight vector, and then is given a new data point to classify (say, as “obese” or “not-obese”), the perceptron simply has to calculate wTx for the new instance of data, see if it falls on one side or the other of the hyperplane, and then classify it accordingly.

从罗森布拉特的想法到将输入线性转换为输出的正式符号,这经历了一段相当漫长的旅程,但这种表述的重要性怎么强调都不为过。这是我们最终进军其他机器学习技术(包括现代深度神经网络)的基石之一。

This has been a somewhat extended journey from Rosenblatt’s ideas to a formal notation for a linear transformation of an input to an output, but it’s hard to overstate the importance of this formulation. It’s one of the cornerstones of our eventual forays into other ML techniques, including modern deep neural networks.

保证成功

GUARANTEED TO SUCCEED

罗森布拉特发明感知器学习算法后不久——我们稍后会谈到它的确切公式——研究人员(包括罗森布拉特)开始分析它,开发定理和证明来表明它确实是一种计算上可行的算法。这些证明表明,如果存在解决方案,感知器就会收敛到一个解决方案,“解决方案”被定义为将数据线性分为两组的超平面。乔治·纳吉记得当时的情况。“罗森布拉特自己收集了这些,”纳吉告诉我。“他收集了一批……20 世纪 60 年代发表的证明。”康奈尔大学的应用数学家亨利·戴维·布洛克 (Henry David Block) 于 1962 年开发了其中一种证明,他与罗森布拉特合作对感知器进行了数学分析。布洛克的证明很复杂,但它为感知器学习算法在尝试寻找线性分离超平面时所犯的错误数量设定了上限。布洛克是一位成就卓著的理论家,擅长推理机器和“可能性的逻辑。”1978 年他去世时,康奈尔大学的教职员工在悼念声明中说道,“尽管大卫·布洛克才智过人、成就卓著,但他却非常谦虚,事实上,他为人谦逊,对一切都很宽容,除了自鸣得意。”

Soon after Rosenblatt invented the perceptron learning algorithm—we’ll come to its exact formulation in a bit—researchers, Rosenblatt included, began analyzing it, developing theorems and proofs to show that it was indeed a computationally viable algorithm. Among these proofs were those that showed that perceptrons would converge upon a solution if one existed, “solution” being defined as a hyperplane that linearly separated the data into two groups. George Nagy remembered the time. “Rosenblatt himself collected these,” Nagy told me. “He had a collection of…proofs that had been published in the 1960s.” One such proof was developed in 1962 by Henry David Block, an applied mathematician at Cornell University who collaborated with Rosenblatt on the mathematical analysis of perceptrons. Block’s proof was complicated, but it established upper bounds for the number of mistakes made by the perceptron learning algorithm as it tried to find a linearly separating hyperplane. Block was an accomplished theoretician, at home with reasoning about machines and “the logic of what’s possible.” When he died in 1978, the Cornell faculty said in their memorial statement, “For all his exceptional intelligence and accomplishments, David Block was a deeply modest, indeed, humble person, tolerant of everything save smugness.”

Block 对自鸣得意的不容忍体现在他的经典的二十二页书评《感知器:计算几何学导论》,这是麻省理工学院科学家和人工智能先驱 Marvin Minsky 和 ​​Seymour Papert 合著的一本三百页的。《感知器》是一部阐述、定理和证明的杰作,在 1969 年出版后引起了巨大轰动。“我们将详细研究一类通过权衡证据来做出决策的计算,”明斯基和帕普特在引言中写道。“我们将要研究的机器是一类设备的抽象版本,这些设备有各种名称;我们同意使用‘感知器’这个名字来纪念 Frank Rosenblatt 的开创性工作。”Block 在他的书评中很早就对这本书大加赞赏:“这是一本了不起的书。作者不仅制定了一个新的基本概念框架,而且还使用惊人巧妙的数学技巧填补了细节。”这些巧妙的数学技巧之一是明斯基和帕普特版本的收敛证明,但他们的随附注释似乎激怒了布洛克。两人提请关注以色列数学家 Shmuel Agmon 于 1954 年发表的一篇论文,后者似乎预见了收敛证明。“从抽象的数学意义上讲,定理和证明在感知器出现之前就已经存在了,”明斯基和帕普特写道。“很明显,如果对感知器感兴趣的控制论专家知道阿格蒙的工作,这个定理就会立刻显现出来。”

Block’s intolerance of smugness comes through in his classic twenty-two-page review of Perceptrons: An Introduction to Computational Geometry, a three-hundred-page book by MIT scientists and AI pioneers Marvin Minsky and Seymour Papert. A tour de force of exposition, theorems, and proofs, Perceptrons made a huge splash upon its publication in 1969. “We will study in great detail a class of computations that make decisions by weighing evidence,” Minsky and Papert write in its introduction. “The machines we will study are abstract versions of a class of devices known under various names; we have agreed to use the name ‘perceptron’ in recognition of the pioneer work of Frank Rosenblatt.” Block praises the book early on in his review: “It is a remarkable book. Not only do the authors formulate a new and fundamental conceptual framework, but they also fill in the details using strikingly ingenious mathematical techniques.” One of these ingenious mathematical techniques was Minsky and Papert’s version of the convergence proof, but their accompanying notes seemed to irritate Block. The duo had drawn attention to a 1954 paper by Israeli mathematician Shmuel Agmon, who had seemingly anticipated the convergence proof. “In an abstract mathematical sense, both theorem and proof already existed before the perceptron,” Minsky and Papert write. “It is quite clear that the theorem would have been instantly obvious had the cyberneticists interested in perceptrons known about Agmon’s work.”

对控制论专家的挖苦让布洛克很不舒服。”“控制论”是美国数学家诺伯特·维纳在 1948 年出版的同名书中创造的一个术语,指的是“动物和机器的控制和交流”的研究。因此,那些研究感知器作为了解人类大脑和神经系统的一种手段的人是控制论专家。他们是否应该知道感知器收敛证明的前兆,这表明算法将在有限的步骤后找到答案?“由于“阿格蒙的作品”中没有关于有限步骤后过程终止的内容,因此该定理的这一方面至少似乎不是“一目了然的”,布洛克在他的评论中打趣道。“此外,尚不清楚谁是‘控制论专家’;但作者可能不把自己归入这一类别。人们可能想知道为什么这种谴责并不适用于所有对感知器感兴趣的人。”布洛克随后引用了明斯基和帕普特在 1961 年发表的关于感知器相关主题的论文,暗示明斯基和帕普特的谴责应该同样适用于他们。布洛克称他是这样认为的:“总而言之,明斯基和帕普特对感知器理论的阐述是精确而优雅的。他们的数学分析非常出色。他们的阐述生动活泼,经常夸夸其谈,有时还带有讽刺意味。”

The dig at cyberneticists rubbed Block the wrong way. “Cybernetics,” a term coined by American mathematician Norbert Wiener in his 1948 book of the same name, refers to the study of “control and communication in the animal and the machine.” So, those doing research on perceptrons as a means of understanding the human brain and nervous system were cyberneticists. Should they have known of precursors to the convergence proof for perceptrons, which shows that the algorithm will find an answer after a finite number of steps? “Since there is nothing in ‘Agmon’s work’…about the termination of the process after a finite number of steps, this aspect of the theorem at least does not seem to be ‘instantly obvious,’ ” Block quipped in his review. “Furthermore, it is not clear who ‘the cyberneticists’ are; but presumably the authors do not include themselves in this category. One might wonder why the rebuke does not apply to all those interested in the perceptron.” Block followed up with references to 1961 papers by both Minsky and Papert on topics related to perceptrons, implying that Minsky and Papert’s castigation should apply equally to them. Block called it as he saw it: “In sum then, Minsky and Papert’s formulation of their theory of perceptrons is precise and elegant. Their mathematical analysis is brilliant. Their exposition is lively, often bombastic, and, occasionally, snide.”

抛开他们的夸夸其谈和尖刻的言论,我们将重点关注Minsky 和 ​​Papert 的收敛证明。但首先,我们需要重新审视 Rosenblatt 的算法,并掌握更多正式的符号。

Putting aside their bombast and snide remarks, we’ll focus on the precision and elegance of Minsky and Papert’s convergence proof. But first, we need to revisit Rosenblatt’s algorithm with more formal notations in hand.

让我们来考虑一个可能存在的实际问题。假设我们已经从冠状病毒大流行的灾难性经历中吸取了教训,那么让我们希望我们能够在下一次涉及新型传染性呼吸道病原体的大流行的头几个月里采取一些明智的应对措施。(但愿它不会很快发生。)在这个更开明的场景中,世界各地的医院都在努力收集大流行早期所见患者的数据。每个患者使用六个变量进行分类:x1 = 年龄,x2 = 体重指数,x3 = 呼吸困难(是 = 1/否 = 0),x4 = 发烧(是/否),x5 = 患有糖尿病(是/否),x6 = 胸部 CT 扫描(0 = 正常,1 = 轻度感染,2 = 重度感染)。这些变量的值构成一个六维向量。每个患者都是指向 6D 空间的一个箭头,或者简单地说是 6D 空间中的一个点。

Let’s take a potentially real problem. Assuming we have learned from our disastrous experience with the coronavirus pandemic, let’s hope we can bring some smarts to how we react during the first months of the next pandemic involving a new infectious respiratory pathogen. (Fingers crossed it doesn’t happen anytime soon.) In this more enlightened scenario, hospitals worldwide diligently collect data about the patients they see early on in the pandemic. Each patient is categorized using six variables: x1 = age, x2 = body mass index, x3 = has difficulty breathing (yes = 1/no = 0), x4 = has fever (yes/no), x5 = has diabetes (yes/no), x6 = chest CT scan (0 = clear, 1 = mild infection, 2 = severe infection). The values for these variables make up a six-dimensional vector. Each patient is an arrow pointing in 6D space, or simply a point in 6D space.

因此,对于第 i位患者,向量xi由 6 个属性 [ x1, x2, x3, x4, x5, x6 ]给出。

So, for the ith patient, the vector xi is given by 6 attributes [x1, x2, x3, x4, x5, x6].

医生会注意到,患者要么在入院后三天内恢复正常并被送回家,要么病情恶化并需要呼吸机支持。因此,每个患者都有一个相关结果y = -1(三天后不需要呼吸机支持)或y = 1(三天后需要呼吸机支持)。

Doctors notice that patients are either okay about three days after coming to the hospital and are sent home or they worsen and need ventilator support. So, each patient has an associated outcome y = -1 (did not need ventilator support after three days) or y = 1 (needed ventilator support after three days).

因此,第i位患者xi有一个标记结果yi,它可以是 -1 或 1。

Hence, the ith patient, xi, has a labeled outcome, yi, which can be either -1 or 1.

许多国家的医生收集n位患者的数据,创建一组n 个数据点:{( x1 , y1 ), ( x2, y2 ),…, ( xn, yn )}

Doctors in many nations collect data for n patients, creating a set of n data points: {(x1, y1), (x2, y2),…, (xn, yn)}

请注意,x1、x2…xn都是向量。它们都具有相同的维度,在本例中为 6。我们必须训练一个感知器,使得给定一个输入,例如x1(关于第一个患者的信息),感知器应该输出相应的值 y1。x2、x3 x4也是如此我们的数据集中,每个xi分类为属于组 -1 或组 1。

Note that x1, x2…xn are all vectors. They all have the same dimension, in this case 6. We have to train a perceptron such that given an input, say, x1 (the information about the first patient), the perceptron should output the corresponding value y1. The same goes for x2, x3, x4, and so on. In our dataset, each xi is classified as belonging to either group -1 or group 1.

我们假设存在于六维中的数据可以线性分为两组。分离超平面将是五维的,无法可视化。您将如何使用这些信息?首先,您将使用已收集的数据训练感知器,以便它找到一些分离超平面。

We assume that the data, which exist in six dimensions, are linearly separable into two groups. The separating hyperplane would be five-dimensional and impossible to visualize. How would you use this information? Well, first you would train the perceptron with the data that have been collected, so that it would find some separating hyperplane.

然后,如果这个训练样本中患者的正确情况适用于所有未来的患者(这是一个主要假设,我们将在后面的章节中更仔细地研究它),想象一个新患者来到医院的场景。你收集必要的数据(x1、x2、x3、x4、x5x6的值)并将它们插入到你的感知器。它应该通过输出 -1 或 1 来告诉您患者三天后是否需要呼吸机支持。例如,这可以用于分类决策。医生可以有把握地让一些人回家,而让其他人进行观察。

Then, if what’s true of the patients in this training sample holds for all future patients—this is a major assumption, and we’ll examine it more carefully in later chapters—imagine a scenario where a new patient comes to the hospital. You collect the necessary data (the values for x1, x2, x3, x4, x5, and x6) and plug them into your perceptron. It should tell you whether the patient will need ventilator support in three days, by outputting -1 or 1. This can, for example, be used for triaging decisions. Doctors can, with some confidence, send some people home but keep others for observation.

训练感知器意味着找到权重向量w的权重 [ w0 , w1 , w2 , w3 , w4 , w5 , w6 ] ,使得:

Training the perceptron means finding the weights [w0, w1, w2, w3, w4, w5, w6] of the weight vector w, such that:

回想一下,w0 表示偏差项,包含在权重向量中。它总是乘以值 1,即x0。

Recall that w0 represents the bias term and is included in the weight vector. It’s always multiplied by the value 1, or x0.

鉴于此,训练算法需要以下步骤:

Given this, the training algorithm requires the following steps:

  • 步骤 1.将权重向量初始化为零:设置w = 0
  • Step 1. Initialize the weight vector to zero: set w = 0
  • 步骤2.对于训练数据集中的每个数据点x ,执行以下操作:
    • 步骤 2a如果y w T x ≤0:
    • 权重向量错误,因此更新它:
      w= w+ y x
  • Step 2. For each data point x in the training dataset, do the following:
    • Step 2a if ywTx ≤0:
    • the weight vector is wrong, so update it:
      wnew= wold + yx
  • 步骤3.如果步骤2中没有对权重向量进行更新,则终止;否则,转到步骤2并再次迭代所有数据点。
  • Step 3. If there were no updates to the weight vector in Step 2, terminate; otherwise, go to Step 2 and iterate over all the data points once again.

感知器首先将权重向量初始化为零,然后检查所选权重向量是否正确地对每个数据点进行分类。这通过首先计算一个数据点的表达式y w T x的值来完成。如果权重对于数据点x是正确的,并且表达式w T x 的计算结果为负值,则表示x位于超平面的左侧;还表示x被归类为标签y = -1 。因此,如果y的预期值为-1 且表达式w T x计算结果为负数,则它们的乘积将为正数。类似地,如果权重正确且w T x计算结果为正数,则表示x位于超平面的右侧;还表示x被归类为标签y = 1 。因此,如果预期值为 1 且表达式w T x计算结果为正数,则它们的乘积将再次为正数。换句话说,如果权重正确,表达式y w T x将始终为正数。

The perceptron begins by initializing the weight vector to zero and then checks to see if the chosen weight vector correctly classifies each data point one at a time. This is done by first calculating the value of the expression ywTx for one data point. If the weights are correct for the data point x and the expression wTx evaluates to a negative value, it means that x lies to the left of the hyperplane; it also means that x is classified with the label y = -1. So, if the expected value of y is -1 and the expression wTx evaluates to a negative number, their product will be positive. Similarly, if the weights are correct and if wTx evaluates to a positive number, it means x lies on the right side of the hyperplane; and it means that x is classified with the label y = 1. So, if the expected value is 1 and the expression wTx evaluates to a positive number, their product will again be positive. In other words, if the weights are correct, the expression ywTx will always be positive.

但如果权重错误,则y w T x将始终为负数。(表达式w T x计算结果为正数,但y的预期值为-1,因此y w T x将为负数;或者,表达式w T x计算结果为负数,但y的预期值为+1,因此y w T x将为负数。)因此,如果y w T x小于或等于零,则说明存在问题,我们应该更新权重和偏差。

But if weights are wrong, then ywTx will always be a negative number. (Expression wTx evaluates to a positive number, but the expected value of y is -1, so ywTx will be negative; or, the expression wTx evaluates to a negative number, but the expected value of y is +1, so ywTx will be negative.) So, if ywTx is less than or equal to zero, then something is wrong, and we should update the weights and bias.

根据该算法,更新权重涉及将y x添加到w。为什么这样做有效?直观地说,此更新会改变权重向量的方向和大小(从而改变超平面的方向),使得原本位于超平面错误一侧的数据点x最终更接近超平面的正确一侧。对于给定的数据点x,可能需要进行多次这样的更新,以确保x被正确归类为位于超平面的正确一侧。(有关正式证明,请参阅本页的数学尾声。)当然,对一个数据点进行修正意味着超平面可能会对部分或全部其他数据点出错。

As per the algorithm, updating the weights involves adding yx to w. Why does this work? Intuitively, this update is changing the direction and magnitude of the weight vector (and hence the direction of the hyperplane) in such a way that the data point x, which was on the wrong side of the hyperplane, ends up a little closer to being on the correct side of it. For a given data point x, one might have to make multiple such updates to ensure that x is correctly classified as being on the correct side of the hyperplane. (For a formal proof, see the mathematical coda on this page.) Of course, making the correction for one data point means that the hyperplane can go wrong for some of or all the other data points.

因此,感知器会逐个数据点地重复这个过程,直到它确定一组适用于所有数据点的权重和偏差的可接受值。在此过程中,感知器会找到两组数据点之间的线性差距。

So, the perceptron iterates over this process, data point by data point, until it settles on an acceptable set of values for the weights and bias that works for all data points. In doing so, the perceptron finds the linear divide between the two sets of data points.

就计算机算法而言,这非常简单。数学家面临的问题是:我们如何确定它会终止?为什么它不会无限期地继续下去,总是至少有一个数据点出错?

As computer algorithms go, this is amazingly simple. The question for mathematicians was this: How can we be sure that it will terminate? Why won’t it keep going indefinitely, by always getting at least one data point wrong?

这就是收敛证明的用武之地——特别是明斯基和帕普特在他们的书《感知器》中提出的一个特别优雅的证明。我们首先重申主要假设:存在一个线性分离超平面,其特征是权重向量w *。感知器必须找到w *。当然,有许多潜在的此类超平面,算法只需找到一个即可。

That’s where the convergence proofs come in—in particular, an especially elegant one by Minsky and Papert in their book, Perceptrons. We start by restating the main assumption: There exists a linearly separating hyperplane characterized by the weight vector w*. The perceptron has to find w*. There are, of course, many potential such hyperplanes, and the algorithm needs to find only one.

该算法首先使用初始化为零的权重向量w 。现在考虑ww *的点积。随着我们更新权重向量w,它开始越来越多地指向所需权重向量w * 的方向,无论选择哪种w *, ww *之间的角度都会趋近于零。w和w *的点积由 给出,不断增加,因为 cos( θ ) 从零(当两个向量垂直且彼此最不相似时变为 1(当它们平行并因此指向同一方向时)。因此,随着算法的学习,我们希望ww * 不断增加;这表明它正在发挥作用。然而,ww * 也可能仅仅因为w的幅度不断增加而方向没有变化而增加。在这种情况下,ww ( w与自身的点积)也会增加。因此,证明的本质是表明在训练过程中,ww 的增加速度低于ww *。如果是这样,当w与w *重合时,算法将在有限的步骤内收敛。渴望理解证明的读者可以在本页的数学尾声中找到它。

The algorithm starts by using a weight vector w initialized to zero. Now consider the dot product of w and w*. As we update the weight vector w and it starts pointing more and more in the direction of the desired weight vector w*, the angle between w and w* approaches zero, regardless of the choice of w*. The dot product of w and w*, given by , keeps increasing, because cos(θ) goes from zero (when the two vectors are perpendicular and most unlike each other) to 1 (when they are parallel and, hence, pointing in the same direction). So, as the algorithm learns, we want w.w* to keep increasing; that’s an indication that it is working. However, w.w* can also increase simply because the magnitude of w keeps increasing while showing no change in direction. In this case, w.w (the dot product of w with itself) will also increase. So, the essence of the proof involves showing that during training, w.w increases less rapidly than w.w*. If that’s the case, the algorithm will converge in a finite number of steps, when w coincides with w*. Readers keen to understand the proof can find it in the mathematical coda on this page.

证明建立了一个不等式。它表示,如果算法在找到解决方案之前更新权重向量 M 次(或犯了 M 个错误),则 M 必须小于或等于一个有限数。它通过建立所谓的下限和上限来实现这一点算法的界限,即算法达到所需解决方案所需的时间和资源的最小值最大值的度量。在计算复杂性理论研究领域,证明算法的这种界限是一项困难、复杂且深奥的任务。

The proof establishes an inequality. It says that if the algorithm updates the weight vector M times (or makes M mistakes) before finding the solution, then M must be less than or equal to a finite number. It does this by establishing what are called lower and upper bounds for the algorithm, which are measures of at least and at most how much time and resources the algorithm needs to arrive at the desired solution. Proving such bounds for algorithms is a difficult, intricate, and esoteric task in a field of research called computational complexity theory.

2018 年,我在加州大学伯克利分校认识的一位年轻研究员曼努埃尔·萨宾 (Manuel Sabin) 在一部由我编剧、主持并联合执导的短片中对此类工作进行了雄辩的阐述(该片是纪录片系列的一部分)。”下限和上限之间存在着深刻的联系。通常,你可以说它们是同一枚硬币的两面,”他说。下限告诉我们某件事是否不可能。假设你证明,随着数据点数量的增加,算法的运行时间会呈指数级增长。然后你会遇到一些问题,而这些问题“直到太阳吞没地球你才会知道答案,”萨宾说。“所以,下限……谈论的是在我们的一生中可以知道的事情。”

In 2018, Manuel Sabin, a young researcher I met at the University of California, Berkeley, gave an eloquent perspective on such work in a short film that I wrote, hosted, and co-directed (the film was part of a documentary series). “There are deep connections between lower bounds and upper bounds. Oftentimes, you can say they are two sides of the same coin,” he said. Lower bounds tell us about whether something is impossible. Say you prove that an algorithm will take exponentially longer to run as you increase the number of data points. You will then encounter problems for which you “won’t know the answer until the sun engulfs the earth,” Sabin said. “So, lower bounds…talks about what is knowable within our lifetimes.”

难怪在 20 世纪 60 年代,为感知器学习算法建立这样的界限是一件大事。如果存在线性分离超平面,该算法总能在有限的时间内找到它。明斯基和帕普特、布洛克等人对大量此类证明负责。感知器风靡一时。

It was no wonder that establishing such bounds for the perceptron learning algorithm was a huge deal in the 1960s. The algorithm will always find a linearly separating hyperplane in finite time if one exists. Minsky and Papert, Block, and others were responsible for a slew of such proofs. Perceptrons were all the rage.

第一次大寒潮

THE FIRST BIG CHILL

但随后,明斯基和帕普特在 1969 年出版的那本书为感知器研究提供了坚实的数学基础,但也给它泼了一盆冷水。在他们的书中的众多证明中,有一个解决了单层感知器永远无法解决的一个非常简单的问题:XOR 问题。请看下图所示的四个数据点。

But then, Minsky and Papert’s 1969 book, which provided such a firm mathematical foundation for research on perceptrons, also poured an enormous amount of cold water on it. Among the many proofs in their book, one addresses a very simple problem that a single layer of perceptrons could never solve: the XOR problem. Look at the four data points shown in the figure below.

任何直线都无法将圆圈与三角形。在这种情况下,点 ( x1, x2 ) 为:(0, 0)、(1, 0)、(1, 1) 和 (0, 1)。为了使感知器将点 (0, 0) 和 (1, 1) 表示的圆与点 (1, 0) 和 (0, 1) 表示的三角形分开,它必须能够在x1x2都为 0 或x1x2都为 1 时生成输出y = 1 ,否则生成输出y = -1。不存在这样的直线,这一点很容易从视觉上看出。明斯基和帕普特证明,单层感知器无法解决此类问题。上面说明的情况是最简单的情况,让人想起布尔逻辑中具有两个输入的异或门,如果两个输入相同,逻辑门输出 1,否则输出 0。

No straight line you can draw will separate the circles from the triangles. The points (x1, x2) in this case are: (0, 0), (1, 0), (1, 1) and (0, 1). For the perceptron to separate the circles, represented by the points (0, 0) and (1, 1), from the triangles, represented by (1, 0) and (0, 1), it must be able to generate an output y = 1 when both x1 and x2 are 0 or both x1 and x2 are 1, and an output y = -1 otherwise. No such straight line exists, something that is easy to see visually. Minsky and Papert proved that a single layer of perceptrons cannot solve such problems. The situation illustrated above is the simplest case and calls to mind the XOR gate with two inputs in Boolean logic, where the logic gate outputs a 1 if both the inputs are the same, and 0 otherwise.

如果堆叠感知器,使一个感知器的输出馈送到另一个感知器的输入,那么就有可能解决 XOR 问题。这就是所谓的多层感知器。罗森布拉特并非不知道这个问题。纳吉告诉我:“他肯定和明斯基一样清楚,甚至比明斯基更清楚单层的局限性。”然而,多层感知器的问题在于,没有人知道如何训练这样的网络,包括明斯基和帕普特。如果必须更新多个感知器的权重,我们之前遇到的算法就不起作用。

It’s possible to solve the XOR problem if you stack perceptrons, such that the output of one feeds into the input of another. These would be so-called multi-layer perceptrons. Rosenblatt wasn’t oblivious to this problem. “He knew certainly as well or better than Minsky the limitations of a single layer,” Nagy told me. However, the problem with multiple layers of perceptrons was that no one knew how to train such networks, including Minsky and Papert. The algorithm we encountered earlier doesn’t work if the weights of more than one layer of perceptrons have to be updated.

围绕神经网络的喧嚣逐渐平息。所有关于“能行走、说话、看、写、自我复制并意识到自己存在”的电子计算机的讨论都烟消云散了,任何概念也都烟消云散了将感知器设备作为“机械太空探测器”发送到其他星球。资助机构犹豫不决,资金消失,曾经很有前途的研究领域几乎陷入停滞。该领域的人士将 1974 年至 1980 年称为第一个人工智能寒冬。剑桥大学卢卡斯应用数学教授詹姆斯·莱特希尔爵士对该领域进行了调查,并于 1972 年发表了一份关于人工智能现状的报告。他的报告甚至有一个名为“过去的失望”的部分。它以这样的话开头:“大多数从事人工智能研究和相关领域的工作者都承认,他们对过去 25 年取得的成就感到失望。人们在 1950 年左右甚至 1960 年左右进入该领域,他们寄予厚望,但 1972 年的希望还远远没有实现。迄今为止,该领域的任何发现都没有产生当时承诺的重大影响。”

The hoopla surrounding neural networks subsided. All talk of an electronic computer that would “walk, talk, see, write, reproduce itself and be conscious of its existence” vaporized, as did any notion of sending perceptron devices to other planets as “mechanical space explorers.” Funding agencies balked, money disappeared, and a once-promising field of research ground to an almost dead halt. Those in the field refer to the years from 1974 to 1980 as the first AI winter. Sir James Lighthill, the Lucasian professor of applied mathematics at Cambridge University, surveyed the field and, in 1972, published a report about the state of AI. His report even had a section called “Past Disappointments.” It begins with these words: “Most workers in Al research and in related fields confess to a pronounced feeling of disappointment in what has been achieved in the past twenty-five years. Workers entered the field around 1950, and even around 1960, with high hopes that are very far from having been realized in 1972. In no part of the field have the discoveries made so far produced the major impact that was then promised.”

就神经网络而言,需要物理学家对生物问题的独特解决方案才能重新激发该领域的活力。那是在 1982 年。然后,在 1986 年,David E. Rumelhart、Geoffrey E. Hinton 和 Ronald J. Williams 发表了一篇关于反向传播算法的开创性论文。(这个想法本身早于他们的工作,但他们的论文让它名声大噪。)该算法展示了如何训练多层感知器,依赖于微积分和优化理论。计算机还需要十五年才能变得足够强大,能够满足人工神经网络的计算需求,但这篇“反向传播”论文引发了一场缓慢燃烧的革命。

As far as neural networks are concerned, it’d take a physicist’s unique solution to a biological problem to re-energize the field. That was in 1982. Then, in 1986, David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams published a pathbreaking paper on an algorithm called backpropagation. (The idea itself predated their work, but their paper put it firmly on the map.) The algorithm, which showed how to train multi-layer perceptrons, relies on calculus and optimization theory. It’d take fifteen more years before computers became powerful enough to handle the computational demands of artificial neural networks, but the “backprop” paper set a slow-burning revolution in motion.

然而,反向传播算法的前身,重点在于微积分,是在罗森布拉特展示感知器的同时形成的。20 世纪 50 年代末,在一个周末里,一位年轻的助理教授和一位才华横溢的研究生发明并实现了一种算法,该算法后来被证明与感知器,并将包含有朝一日训练多层神经网络的线索。

The precursor to the backpropagation algorithm, with its emphasis on calculus, however, was taking shape at about the same time as Rosenblatt was showing off his perceptron. Toward the very end of the 1950s, over the course of a weekend, a young assistant professor and an immensely talented graduate student invented and implemented an algorithm that would prove to be as important as the perceptron and would contain clues for one day training multi-layer neural networks.

数学尾声

MATHEMATICAL CODA

您可以跳过以下证明;这样做不会影响您对后续章节的理解。不过,我应该说,那是在听康奈尔大学计算机科学教授 Kilian Weinberger 的讲座录音,他在 2018 年机器学习课程中向学生解释了这个证明,这让我意识到我想写这本书。这是一个漂亮的证明。

Feel free to skip the following proof; doing so won’t impact your understanding of what comes in subsequent chapters. I should say, however, that it was while listening to recordings of lectures by Kilian Weinberger, professor of computer science at Cornell University, in which he explains this proof to the students of his 2018 course on machine learning, that I realized I wanted to write this book. It’s a beautiful proof.

算法:感知器更新规则

THE ALGORITHM: THE PERCEPTRON UPDATE RULE

(该规则和证明改编自温伯格的演讲。)

(This rule and proof adapted from Weinberger’s lecture.)

  • 步骤1.将权重向量初始化为零:设置w = 0。
  • Step 1. Initialize the weight vector to zero: set w = 0.
  • 步骤2.对于训练数据集中的每个数据点x ,执行以下操作:
    • 步骤 2a如果y w T x ≤ 0:
    • 权重向量错误,因此更新它:
      w= w+ y x
  • Step 2. For each data point x in the training dataset, do the following:
    • Step 2a if ywTx ≤ 0:
    • the weight vector is wrong, so update it:
      wnew= wold + yx
  • 步骤3.如果步骤2中权重向量没有更新,则终止,否则转到步骤2并再次迭代所有数据点。
  • Step 3. If there were no updates to the weight vector in step 2, terminate, otherwise go to step 2 and iterate over all the data points once again.

如果y w T x ≤ 0 ,我们就对权重向量进行更新(请参阅本页“保证成功”中的解释,了解为什么会出现这种情况):

We make an update to the weight vector if ywTx ≤0 (see the explanation in the “Guaranteed to Succeed” on this page for why this is the case):

w= w+ y x

wnew = wold + yx

为了使新的权重向量能够正确地对x进行分类,我们需要证明,最终,y w T x > 0(因为如果它 ≤ 0,就需要更新)。在更新的每个步骤中:

For the new weight vector to classify x correctly, we need to prove that, eventually, y wTx > 0 (because if it were ≤ 0, it would have required updating). At each step of the update:

第二个右侧项y 2 x T x ≥ 0,因为y 2 = 1,且x T x ≥ 0。为什么x T x ≥ 0?因为它是向量与自身的点积。它始终为正数或零。它类似于对标量求平方 — 您始终会得到正数或零。

The second right-hand side term y2 xT x is ≥ 0, because y2 = 1, and xT x ≥ 0. Why is xT x ≥ 0? Well, it’s the dot product of a vector with itself. That’s always a positive number or zero. It’s akin to squaring a scalar—you’ll always get a positive number or zero.

因此,经过一次更新后, 的负值会比 略小,这意味着权重向量对于一个数据点x朝正确的方向移动。最终,经过未指定的次数更新后,算法将正确分类x。必须对每个数据点重复此过程,直到权重向量正确分类所有数据。

So, , after one update, is a little less negative than, which means that the weight vector is moving in the right direction for the one data point x. Eventually, after some unspecified number of updates, the algorithm will classify x correctly. This process must be repeated for every data point until the weight vector classifies all data correctly.

接下来的证明表明,找到新的、正确的权重向量所需的更新次数总是有限的。

The proof that follows shows that the number of updates required to find the new, correct weight vector is always finite.

感知器收敛证明

THE PERCEPTRON CONVERGENCE PROOF

假设:

Assumptions:

w:初始化为零的 d 维权重向量;
w: the d-dimensional weight vector initialized to zero;
w *:感知器需要学习的 d 维权重向量;它垂直于线性分离超平面。令w * 为量级为 1 的单位向量;
w*: the d-dimensional weight vector the perceptron has to learn; it’s perpendicular to the linearly separating hyperplane. Let w* be a unit vector of magnitude 1;
x:表示输入数据点或实例的向量;x是 d 维向量,因此具有元素 [ x1, x2,…, xd ]。如果有n 个数据点,则每个这样的实例都是更大的n × d矩阵X(n 行,d 列)中的一行;
x: the vector representing an input data point, or instance; x is a d-dimensional vector, so with elements [x1, x2,…, xd]. If there are n data points, then each such instance is a row in a larger n × d matrix X (n rows, d columns);
y:感知器的输出,对于输入向量x;输出可以是 -1 或 1。所有输出可以收集到一个 n 维向量y中:[ y1, y2,…, yn ];并且
y: the output of the perceptron, for an input vector x; the output can be -1 or 1. All the outputs can be collected into a one n-dimensional vector y: [y1, y2,…, yn]; and
γ(gamma):线性分离超平面与最近数据点之间的距离。
γ (gamma): the distance between the linear separating hyperplane and the closest data point.

以下是感知器的方程(忽略显式偏差项;我们之前看到了如何将其合并到这个公式中):

The following is the equation for the perceptron (ignoring an explicit bias term; we saw earlier how to incorporate it into this formulation):

目标是证明如果你不断更新w,它将收敛到w * (即两个向量将指向同一方向)。并且因为w * 按照定义垂直于分离超平面,所以w也垂直于分离超平面。

The goal is to prove that if you keep updating w, it will converge to w* (meaning, the two vectors will point in the same direction). And because w*, by definition, is perpendicular to the separating hyperplane, so, too, will be w.

首先,对所有输入数据点进行归一化,使得距离原点最远的数据点的幅度为 1,而所有其他数据点的幅度都小于或等于 1。这可以通过将每个向量x除以距离原点最远的数据点或向量的幅度来实现。因此,最远的向量现在的幅度为 1,而所有其他向量的幅度都小于或等于 1。这不会改变数据点/向量,因为我们只是将它们的大小缩小相同的量;它们的方向保持不变。

First, normalize all the input data points such that the data point farthest from the origin has a magnitude of 1 and all other data points have magnitudes less than or equal to 1. This can be done by dividing each vector x by the magnitude of the data point, or vector, that’s farthest from the origin. So, the farthest vector will now have a magnitude of 1, and all other vectors will have magnitudes that are less than or equal to 1. This doesn’t change the relationship between the data points/vectors because we are simply scaling down their magnitudes by the same amount; their directions remain the same.

一旦标准化,0 < γ < = 1。

Once normalized, 0 < γ < = 1.

回想一下,当输入x被错误分类时,我们会更新权重向量:

Recall that we update the weight vector when an input, x, is classified incorrectly:

如果y w T x ≤ 0:

If ywTx ≤ 0:

ww+ y x
wnewwold + yx

w越来越接近我们期望的方向w * 时,两个向量的点积,或者说w T w * 会变得越来越大。

As w comes closer to w*, which is our desired direction, the dot product of the two vectors, or wTw*, gets bigger.

但是,如果w的幅度增长且方向相对于w *不变,则w T w * 也会增长。如果w 的幅度增长,则w T w ( w与自身的点积)也会增长。因此,只有当w T w * 的增长速度快于w T w时,算法才会收敛,因为这种情况只会在w与w *保持一致而不仅仅是幅度增长时才会发生。

But wTw* can also increase if w grows in magnitude without changing direction relative to w*. If w is growing in magnitude, then wTw, which is the dot product of w with itself, will also increase. So, the algorithm will converge only if wTw* increases faster than wTw, as that would happen only because w is getting aligned with w* and not just increasing in magnitude.

让我们在每次更新时计算w T w *:

Let’s calculate wTw* upon each update:

= (w+ y xT w *

= (wold + yx)Tw*

右边的第二项是y x T w *。如果有两个 d 维向量ab,我们知道a T b = b T a。因此,y x T w * = y w * T x。我们知道y w * T x > 0,因为w * 是正确的假定权重向量,它应该可以正确分类x

The second term of the right-hand side is yxTw*. If you have two d-dimensional vectors a and b, we know that aTb = bTa. So, yxTw* = yw*Tx. We know that yw*Tx > 0, because w* is the correct presumed weight vector, and it should classify x correctly.

单位向量w * 和x的点积是x与以w *为特征的超平面的距离。我们将 γ 定义为最近数据点与超平面之间的距离。因此,y w * T x不仅大于 0,而且始终大于或等于 γ。

The dot product of the unit vector w* and x is the distance of x from the hyperplane characterized by w*. We defined γ as the distance between the closest data point and the hyperplane. So, yw*Tx is not only greater than 0, but also always greater than or equal to γ.

因此,

Thus,

中期结果 1:这告诉我们一些重要的事情。每次更新时,ww *之间的点积至少会增加γ。

Interim result 1: This is telling us something important. The dot product between w and w* grows by at least γ with each update.

现在我们来检查一下w T w的增长率。

Now let’s examine the rate of growth of wTw.

= ( w+ y x ) T ( w+ y x )

= (wold + yx)T (wold + yx)

= ( w+ y x ) T w+ ( w+ y x ) T y x

= (wold + yx)Twold + (wold + yx)Tyx

自从:

Since:

因此,新权重向量与自身的点积等于旧权重向量与自身的点积加上两个新项。我们必须算出新项的贡献。

So, the dot product of the new weight vector with itself equals the dot product of the old weight vector with itself plus two new terms. We must figure out the contribution of the new terms.

我们知道第一个新项是因为。这就是我们对权重向量进行更新的原因。

We know that the first new term because . That’s the reason we are doing an update to the weight vector.

第二个新项是y 2 x T x。因为y要么是 +1 要么是 -1,所以y 2 = 1。此外,x T x始终小于或等于 1(这是因为我们之前已对所有表示数据点的向量进行了归一化,因此它们的幅度始终小于或等于 1)。

The second new term is y2xTx. Because y is either +1 or -1, y2 = 1. Also, xTx is always less than or equal to 1 (this is because we normalized all the vectors representing data points earlier, so their magnitudes are always less than or equal to 1).

因此,等式变成:

So, the equation becomes:

中期结果 2:这告诉我们,每次更新时,权重向量与其自身的点积最多增加 1。

Interim result 2: This is telling us that the dot product of the weight vector with itself grows by at most 1 with each update.

现在,一方面,我们有w T w * 在每次更新时至少增长 γ,另一方面,我们有w T w在每次更新时最多增长 1。

Now, on the one hand, we have wTw* growing by at least γ with each update and, on the other hand, we have wTw growing by at most 1 with each update.

假设算法进行了 M 次更新以找到线性分离超平面。我们的任务是证明 M 是一个有限数,并且算法会收敛到一个解。

Let’s say the algorithm makes M updates to find the linearly separating hyperplane. Our task is to prove that M is a finite number and that the algorithm converges to a solution.

我们从初始化为零的权重向量开始,因此w T w * 的初始值为零。第一次更新后,点积至少会增加 γ。

We start with the weight vector initialized to zero, so the initial value of wTw* is zero. After the first update, the dot product would have grown at least by γ.

1 次更新后:w T w * ≥ γ

After 1 update: wTw* ≥ γ

2次更新后:w T w * ≥ γ + γ = 2γ

After 2 updates: wTw* ≥ γ + γ = 2γ

经过3次更新后:w T w * ≥ 2γ + γ = 3γ

After 3 updates: wTw* ≥ 2γ + γ = 3γ

经过 M 次更新后:w T w * ≥ ( M - 1) γ + γ = M γ

After M updates: wTw* ≥ (M - 1) γ + γ = Mγ

因此Mγ≤wTw * 1 )​

So: M γ ≤ wTw*(1)

类似地,通过使用中间结果 2,即w T w在每次更新后最多增加 1,经过 M 次更新后,我们应该有:

Similarly, by using interim result 2, which says that wTw increases by at most 1 after each update, after M updates, we should have:

wTw≤M 2

wTw M …(2)

现在,根据(1),我们有:

Now, because of (1) we have:

Mγ≤wTw *​​

Mγ ≤ wTw*

=|| w |||| w *|| cos ( θ );这是点积的定义。

=||w||||w*||cos(θ); this is the definition of the dot product.

,因为 0 ≤ cos(θ) ≤ 1 且 || w *|| = 1,这是设计使然。因此:

⇒ , because 0 ≤ cos(θ) ≤ 1 and ||w*|| = 1, by design. Therefore:

,因为,根据定义。

, because , by definition.

可以使用(2)中的结果替换右边,得到:

The right-hand side can be substituted using the result in (2), giving us:

经过所有这些分析,我们得出了一个惊人的结果:感知器为找到线性分离超平面所做的更新次数小于或等于 1/γ2 因为 gamma 始终是小于或等于 1 的正量,所以 M 始终是有限量。感知器将在有限步骤内毫无差错地收敛。

After all that analysis, we have arrived at a staggering result: The number of updates that the perceptron makes to find a linearly separating hyperplane is less than or equal to 1 over γ2. Because gamma is always a positive quantity that’s less than or equal to 1, M is always a finite quantity. The perceptron will converge without fail in a finite number of steps.

量子电动力学

QED

第三章

CHAPTER 3

碗底

The Bottom of the Bowl

是 1959 年秋天。30 岁左右的年轻学者 Bernard Widrow 正在斯坦福大学的办公室里,一位名叫 Marcian “Ted” Hoff 的研究生来找他。这位年轻人得到了高度推荐。前一天,斯坦福大学的一位资深教授代表 Hoff 联系了 Widrow,说:“我有个学生叫特德·霍夫。我似乎没法让他对我的研究感兴趣;也许他会对你正在做的事情感兴趣。你愿意和他谈谈吗?”威德罗回答说:“当然,我很乐意。”

It was the autumn of 1959. Bernard Widrow, a young academic on the cusp of turning thirty, was in his office at Stanford University when a graduate student named Marcian “Ted” Hoff came looking for him. The young man arrived highly recommended. The day before, a senior professor at Stanford had reached out to Widrow on Hoff’s behalf, saying, “I’ve got this student named Ted Hoff. I can’t seem to get him interested [in my research]; maybe he’d be interested in what you’re doing. Would you be willing to talk with him?” Widrow replied, “Sure, happy to.”

“所以,第二天,特德霍夫就来敲我家的门了,”威德罗告诉我。

“So, the next day, knocking on my door was Ted Hoff,” Widrow told me.

Widrow 欢迎 Hoff 的到来,并开始讨论他的工作,他的工作重点是自适应滤波器(一种学习将信号与噪声分离的电子设备)以及使用微积分优化此类滤波器。当 Widrow 在黑板上写下数学题时,Hoff 也加入了进来,很快谈话就变得更加激烈。在那次讨论中,两人发明了后来被称为最小均方 (LMS) 算法的算法,该算法已成为机器学习中最具影响力的算法之一,已被证明是那些研究如何训练人工神经网络的人的基础。“当我第一次在黑板上写下 LMS 算法时,不知何故,我凭直觉就知道“这真是一件意义深远的事情,”威德罗告诉我。“可惜我没有相机可以拍照。”

Widrow welcomed him in and proceeded to discuss his work, which was focused on adaptive filters—electronic devices that learn to separate signals from noise—and the use of calculus to optimize such filters. As Widrow chalked up the math on the blackboard, Hoff joined in, and soon the conversation morphed into something more dramatic. During that discussion, the two invented what came to be called the least mean squares (LMS) algorithm, which has turned out to be one of the most influential algorithms in machine learning, having proven foundational for those figuring out how to train artificial neural networks. “When I wrote the LMS algorithm on the blackboard for the first time, somehow I just knew intuitively that this is a profound thing,” Widrow told me. “Too bad I didn’t have a camera to take a picture.”

威德罗在康涅狄格州的一个小镇长大。他几乎无法想象自己会拥有如此辉煌的学术生涯。他的父亲经营着一家制冰厂。好奇的威德罗总是在工厂周围徘徊,在发电机、马达和压缩机之间徘徊,总是问问题。他很钦佩工厂的电工,电工教了他这个行业的基本知识。当威德罗还在读高中时,他的父亲让他坐下,问他:“你长大后想做什么?”

Widrow grew up in a small town in Connecticut. He could hardly have imagined his luminous academic career. His father ran an ice-manufacturing plant. A curious young Widrow hung around the plant, amid generators, motors, and compressors, always asking questions. He admired the plant’s electrician, who taught him the basics of the trade. While Widrow was still in high school, his dad sat him down and asked, “What do you think you want to be when you grow up?”

少年回答说:“我想成为一名电工。”

The teen answered, “I want to be an electrician.”

他的父亲说:“你不会想成为一名电工。你想成为一名电气工程师。”

His father said, “You don’t want to be an electrician. You want to be an electrical engineer.”

正是这种微妙的人生轨迹让 Widrow 于 1947 年进入麻省理工学院,并在那里获得了学士、硕士和博士学位。1956 年,他以助理教授的身份加入麻省理工学院。同年夏天的一天,Widrow 的同事 Ken Shoulders 来到实验室,告诉他自己正在参加达特茅斯学院的人工智能研讨会。Widrow 想去吗?“我问他,‘什么是人工智能?’他说,‘我不知道。但听起来很有趣。’于是,我说,‘当然可以。我跟你一起去。’”

That subtle course correction led Widrow to MIT in 1947, where he obtained his bachelor’s, master’s, and doctorate; he joined MIT as an assistant professor in 1956. One day during the summer of that year, Widrow’s colleague Ken Shoulders came into the lab and told him about a workshop on artificial intelligence at Dartmouth College he was attending; did Widrow want to come? “I said, ‘What’s artificial intelligence?’ He said, ‘I don’t know. But it sounds interesting.’ So, I said, ‘Sure. I’ll go with you.’ ”

“人工智能”一词的提出者是达特茅斯学院的数学教授约翰·麦卡锡。1955 年 8 月,麦卡锡、当时在哈佛大学任职的马文·明斯基、IBM 的纳撒尼尔·罗切斯特和贝尔电话实验室的克劳德·香农提出了“达特茅斯人工智能夏季研究项目提案”。提案以一个大胆的宣言开头:

The coining of the term “artificial intelligence” is credited to John McCarthy, a mathematics professor at Dartmouth College. In August 1955, McCarthy; Marvin Minsky, who was then at Harvard University; Nathaniel Rochester of IBM; and Claude Shannon of Bell Telephone Laboratories, put out “A Proposal for the Dartmouth Summer Research Project on Artificial Intelligence.” It began with a bold declaration:

我们提议于 1956 年夏天在新罕布什尔州汉诺威的达特茅斯学院进行一项为期 2 个月、由 10 人参与的人工智能研究。这项研究基于这样的推测:学习的每个方面或智能的任何其他特征在原则上都可以被如此精确地描述,以至于机器可以对其进行模拟。我们将尝试找到如何让机器使用语言、形成抽象和概念、解决目前只有人类才能解决的各种问题并自我改进。我们认为,如果一组精心挑选的科学家在一个夏天内共同努力,那么在这些问题中的一个或多个方面就可以取得重大进展。
We propose that a 2 month, 10 man study of artificial intelligence be carried out during the summer of 1956 at Dartmouth College in Hanover, New Hampshire. The study is to proceed on the basis of the conjecture that every aspect of learning or any other feature of intelligence can in principle be so precisely described that a machine can be made to simulate it. An attempt will be made to find how to make machines use language, form abstractions and concepts, solve kinds of problems now reserved for humans, and improve themselves. We think that a significant advance can be made in one or more of these problems if a carefully selected group of scientists work on it together for a summer.

威德罗记得那是一个公开研讨会,不需要邀请。你可以去,想待多久就待多久。如果你有话要说,你可以提出来,或者你可以只是听。威德罗听完后,精神抖擞地回到麻省理工学院。他想制造一台思考机器。“我花了六个月的时间思考思考,”他说。“结论是,凭借当时的电路和技术,我预计我们还需要 25 年才能制造出一台思考机器。”对于一个职业生涯刚刚起步的年轻研究人员来说,这似乎是一次鲁莽的冒险。威德罗放弃了他的计划,转而研究更具体的东西:可以学习消除信号噪声的自适应滤波器。他对诺伯特·维纳 (Norbert Wiener) 开发的自适应模拟滤波器的数字形式特别感兴趣。(我们在上一章中遇到了维纳,他是“控制论”一词的创造者。)

Widrow remembered it as an open seminar, no invitation needed. You could go and stay as long as you wanted. You spoke up if you had something to say, or you could simply listen. Widrow listened and then returned to MIT all charged up. He wanted to build a thinking machine. “I spent six months thinking about thinking,” he said. “The conclusion was that with the circuitry, the technology that we had at that time, I expected it’d be twenty-five years before we’d be able to build a thinking machine.” For a young researcher at the start of his career, it seemed a foolhardy venture. Widrow abandoned his plans and turned to something more concrete: adaptive filters that could learn to remove noise from signals. He was particularly interested in the digital form of adaptive analog filters developed by Norbert Wiener. (We encountered Wiener in the previous chapter as the man who coined the term “cybernetics.”)

要理解维纳的模拟滤波器,请考虑一些连续变化(因此是模拟)的信号源。一些噪声被添加到信号中,滤波器的工作就是区分信号和噪声。维纳的滤波器理论展示了如何做到这一点。其他人将该理论应用于数字信号。数字信号不是连续的,而是离散的,这意味着它们只在某些时间点具有值(例如,每毫秒一次)。Widrow 想要构建一个数字滤波器,但这个滤波器可以随着时间推移而学习和改进。换句话说,它可以从错误中吸取教训,成为更好的自己。

To understand Wiener’s analog filter, consider some continuously varying (hence analog) signal source. Some noise is added to the signal, and the filter’s job is to tell signal from noise. Wiener’s filter theory showed how this could be done. Others adapted the theory to digital signals. Instead of being continuous, digital signals are discrete, meaning they have values only at certain points in time (say, once every millisecond). Widrow wanted to build a digital filter, but one that could learn and improve over time. In other words, it would learn from its mistakes and become a better version of itself.

这种自适应滤波器的核心是一些巧妙的微积分。假设滤波器在任何给定时间都会出错。假设我们能够在十个时间步骤内跟踪此类错误。我们必须通过查看滤波器的前十个错误并调整其参数来减少滤波器的错误。错误的一种衡量标准就是前十个错误的平均值。但是,错误可以是正的也可以是负的,如果你只是将它们相加取平均值,它们可能会相互抵消,给人留下滤波器运行良好的错误印象。为了避免这种情况,请取每个错误的平方(从而将其变成正量),然后取错误平方的平均值。还有其他平方误差和平均误差的优点与统计学和微积分有关,但我们现在还不需要关注这些。目标是最小化滤波器参数的“均方误差”(MSE)。重申一下,我们必须在每个时间步骤中更改滤波器参数的值,以使过去(例如十个步骤)的平方误差的平均值或均值最小化。要理解它的工作原理,需要深入研究一些简单的微积分并学习一种1847 年,法国数学家、工程师和物理学家奥古斯丁·路易·柯西男爵首次提出了这个概念。它被称为最速下降法。

At the heart of such an adaptive filter is a nifty bit of calculus. Imagine that the filter, at any given time, makes an error. Let’s assume that we are able to keep track of such errors for ten time steps. We must reduce the error the filter makes by looking at its previous ten errors and adjusting its parameters. One measure of the mistakes is simply the average of the previous ten errors. However, errors can be positive or negative, and if you just add them to take the average, they can cancel each other out, giving the wrong impression that the filter’s working well. To avoid this, take the square of each error (thus turning it into a positive quantity) and then take the average of the squares of the errors. There are other advantages to squaring the errors and averaging them that have to do with statistics and calculus, but we don’t need to focus on those yet. The goal is to minimize this “mean squared error” (MSE) with respect to the parameters of the filter. To restate, we must change the values of the filter’s parameters at each time step such that the average, or mean, of the squared errors of the past, say, ten steps is minimized. Understanding how this works requires delving into some simple calculus and learning a method that was first proposed in 1847 by Baron Augustin-Louis Cauchy, a French mathematician, engineer, and physicist. It’s called the method of steepest descent.

从高处走下来

DOWN FROM ON HIGH

如果你看过山坡上的稻田照片,或者最好是去过那里,特别是在中国、日本和越南,你可能会惊叹于山坡上切出的平坦梯田。如果我们沿着任何一个梯田行走,我们都会保持在同一高度。梯田的边缘梯田勾勒出地形的轮廓。想象一下站在山坡上的梯田上。山谷下方有一个村庄。我们必须去村庄,但天色渐暗,我们只能看到前方几英尺远。假设山坡不太陡峭,即使是最陡峭的部分也可以爬下来。我们该怎么做?

If you have seen pictures of—or, better yet, visited—rice paddies on hillsides, particularly in China, Japan, and Vietnam, you may have marveled at the flat terraces cut into the sides of the hills. If we walk along any one terrace, we remain at the same elevation. The edges of the terraces trace the contours of the terrain. Imagine standing on some terrace way up a hillside. Down in the valley below is a village. We have to get to the village, but it’s getting dark, and we can see only a few feet ahead of us. Let’s say the hillside is not too steep and that it’s possible to clamber down even the steepest parts. How will we proceed?

我们可以站在梯田的边缘,寻找通往下方梯田的最陡峭路线。这也是通往下一块平地的最短路径。如果我们从梯田到梯田重复这个过程,我们最终会到达村庄。这样一来,我们就选择了最陡峭的下降路径。(从我们最初的位置到村庄,这可能不是一条直线;我们可能不得不沿着山坡曲折而下。)

We can stand at the edge of the terrace and look for the steepest route to the terrace below. That’s also the shortest path down to the next piece of level ground. If we repeat the process from terrace to terrace, we will eventually reach the village. In doing so, we will have taken the path of steepest descent. (This might not be a straight line from our initial position down to the village; we might have had to zigzag down the hillside.)

我们本能地站在梯田边缘,朝不同方向看,评估山坡的坡度,然后每次都选择最陡的路下山。可以说,我们只是在脑子里做了一些微积分。

What we instinctively did was evaluate the slope, or the gradient, of the hillside as we looked in different directions while standing at the edge of a terrace and then took the steepest path down each time. We just did some calculus in our heads, so to speak.

更正式地,让我们看一下二维曲线的斜率,由方程y = x 2给出。

More formally, let’s look at going down a slope for a 2D curve, given by the equation y = x2.

首先,我们在 xy 平面上绘制曲线,然后定位自己在曲线上的某个x值处,比如x = 0.5。在该点处,曲线具有斜率。

First, we plot the curve on the xy plane and then locate ourselves on the curve at some value of x, say, x = 0.5. At that point, the curve has a slope.

找到曲线斜率的一种方法是在感兴趣的点处画一条切线。切线是一条直线。想象一下沿着直线走一小段。你会到达一个新的位置,其中 x 坐标发生了一个无穷小的变化(Δx;读作“delta x”),y 坐标也发生了相应的无穷小的变化(Δy)。斜率为。(如果你想到爬楼梯,那么楼梯的斜率就是上升高度除以运行高度,其中上升高度是Δy,也就是你每一步垂直上升的幅度,运行高度是Δx,也就是你每一步水平移动的幅度。)

One way to find the slope of a curve is to draw a tangent to the curve at the point of interest. The tangent is a straight line. Imagine walking a smidgen along the straight line. You would be at a new location, where the x-coordinate has changed by an infinitesimal amount (Δx; read that as “delta x”) and the y-coordinate has also changed by a corresponding infinitesimal amount (Δy). The slope is . (If you think of climbing stairs, then the slope of the stairs is given by the rise divided by the run, where the rise is Δy, or how much you go up vertically with each step, and the run is Δx, the amount you move horizontally with each step.)

当然,在我们的例子中,当你这样做时,你是沿着曲线的切线移动,而不是沿着曲线本身移动。因此,斜率实际上与切线有关,而不是与曲线有关。但是,如果 x 方向的变化Δx趋近于零,则切线的斜率会越来越接近感兴趣点处的曲线斜率,直到Δx = 0时两者相等。但是当分母为零时,如何计算斜率?这就是微积分的用武之地。

Of course, when you do this in our example, you have moved along the tangent to the curve, not along the curve itself. So, the slope really pertains to the tangent and not the curve. However, if the change in the x-direction, Δx, approaches zero, then the slope of the tangent line gets closer and closer to the slope of the curve at the point of interest, until the two become the same when Δx = 0. But how do you calculate the slope when the denominator in is zero? That’s where calculus steps in.

微分学是微积分的一个分支,可以让我们计算连续函数(没有尖点、断点或不连续点的函数)的斜率。它可以让您分析得出极限 ∆ x → 0 时的斜率(读作“delta-x 趋向于零”),这意味着您在 x 方向上迈出的步数变得非常小,接近于零。这个斜率称为函数的导数。

Differential calculus is a branch of calculus that lets us calculate the slope of a continuous function (one that has no cusps, breaks, or discontinuities). It lets you analytically derive the slope in the limit ∆x → 0 (read that as “delta-x tends to zero”), meaning the step you take in the x-direction becomes vanishingly small, approaching zero. This slope is called the derivative of a function.

对于我们的函数y = x 2,导数等于2x。(求函数的导数是微积分的核心,但我们不会在这里深入讨论细节。对于本书中使用的函数,我仅提供导数。)要了解如何找到函数的导数以及常见函数导数的列表,请查阅 Wolfram MathWorld。)

For our function y = x2, the derivative equals 2x. (Finding the derivative of a function is at the heart of differential calculus, but we won’t get into the details here. For the functions used in this book, I’ll simply provide the derivatives. To understand how to find the derivative of a function and for a list of derivatives of common functions, consult Wolfram MathWorld.)

我们将导数写为:

We write our derivative as:

这被称为y关于x的导数。

This is called the derivative of y with respect to x.

微积分可能会让我们眼花缭乱,甚至会让我们感到恐惧。但正如物理学教授、电气工程师、皇家学会会员西尔瓦努斯·P·汤普森 (Silvanus P. Thompson) 在其经典著作《微积分入门》(首次出版于 1910 年)中写道,“只需用常识性术语说明主要符号的含义,就可以一劳永逸地消除对微积分符号的“初步恐惧”。他指出,符号d只是表示“一点点”。所以,就是y的一点点除以x的一点点。微积分的美妙之处在于,即使“ x的一点点”趋于零,或者dx → 0,你也可以计算出这个比率。

Calculus can cause our eyes to glaze over, if not induce downright dread. But as Silvanus P. Thompson, a professor of physics, electrical engineer, and member of the Royal Society, wrote in his classic Calculus Made Easy (first published in 1910), the “preliminary terror” of symbols in calculus “can be abolished once [and] for all by simply stating what is the meaning—in common-sense terms—of the…principal symbols.” The symbol d, he points out, simply means a “little bit of.” So, is a little bit of y divided by a little bit of x. The beauty of calculus is that you can calculate this ratio even as that “little bit of x” tends to zero, or dx → 0.

给定导数,现在就可以确定曲线上任意一点的斜率。因此,对于函数y = x 2,斜率为 ,在x = 2 处,斜率等于 2 x,等于 4。x = 1 时,斜率为 2;当x = 0.5 时,斜率为 1;当x = 0 时,斜率为 0。您可以看到,当我们沿着曲线移动时,随着x的值从 2 开始减小,斜率也会减小,直到函数达到最小值(此时斜率为零),然后进一步减小。斜率通常在函数的最小值处为零;在此示例中,(x, y) 坐标在最小值处也恰好是 (0, 0),但不一定如此。

Given the derivative, you now have a way of determining the slope at any point along the curve. So, for the function y = x2, the slope , and at x = 2 the slope is equal to 2x, which equals 4. At x = 1, the slope is 2; at x = 0.5, the slope is 1; and at x = 0, the slope is 0. You can see that when we move along the curve, as the value of x decreases from 2, so does the slope, until the function reaches a minimum (where the slope becomes zero), and then decreases further. The slope in general is zero at the minimum of a function; in this example, the (x, y) coordinates also happen to be (0, 0) at the minimum, but that doesn’t have to be so.

现在,我们已经能够理解最速下降法,也称为梯度下降法。假设我们位于坐标 (1, 1)。我们想要到达曲线的底部,即斜率为零且函数值最小的地方。在曲线上的任何一点,只有两种方式可以走。走一条路会让你远离底部;走另一条路会让你更接近底部。朝着正确方向迈出一步的诀窍是,首先,计算当前位置的斜率或梯度。(“梯度”一词有更具体的含义,但我们还是在这里使用它。)在这种情况下,对于x = 1,梯度为 2。如果函数是碗状的,就像这个一样,那么通往最小值的路径需要朝着降低梯度的方向前进。因此,我们采取一步,使 x 坐标的值减小某个步长 ( η ) 乘以该点的梯度:

We are now equipped to understand the method of steepest descent, also known as the method of gradient descent. Let’s say we are at the coordinate (1, 1). We want to reach the bottom of the curve, where the slope is zero and the value of the function is at its minimum. At any point along the curve, there are only two ways you can go. Going one way takes you away from the bottom; going the other way brings you closer to the bottom. The trick to taking a step in the right direction is, first, to calculate the slope, or gradient, at your current location. (The term “gradient” has a more specific meaning, but let’s use it here nonetheless.) In this case, for x = 1, the gradient is 2. If the function is bowl-shaped, as this one is, then the path toward the minimum involves going in a direction that decreases the gradient. So, we take a step such that the value of the x-coordinate is reduced by some step size (η) multiplied by the gradient at that point:

x= x- η. 梯度

xnew = xold - η. gradient

让我们花点时间看看为什么采取降低x值的步骤会降低梯度。对于我们的方程,梯度由 2 x 给出。因此,如果x的新值小于x 的旧值,则新位置的梯度将低于之前。新的x 坐标为我们提供了新的 y 坐标。我们最终到达了一个新的位置。我们重复该过程,直到梯度变为零或接近零。(我们已经到达底部或足够接近底部。)以下是描述该过程的图表:

Let’s take a moment to see why taking a step that reduces the value of x reduces the gradient. For our equation, the gradient is given by 2x. So, if the new value of x is less than the old value of x, the gradient at the new location will be lower than before. The new x-coordinate gives us a new y-coordinate. We end up at a new location. We repeat the process until the gradient becomes zero or close to zero. (We have reached the bottom or near enough to it.) Here’s a graph depicting the process:

步长η必须是某个较小的数,或者是一个分数(比如 0.1)。为什么?主要是因为当你接近底部时,你要非常小心,不要超过最小值而导致曲线另一侧的数值更高。如果这样做,根据函数的不同,算法可能会让你沿着曲线向上游走,远离最小值。另外,请注意,虽然算法在每次迭代中步长都是相同的,但是曲线上跳跃的大小在开始时较大,而在接近底部时变小。同样,为什么?这是因为我们从 x 坐标中减去梯度的倍数以获得新的 x 坐标。在我们的算法中,倍数或步长不会改变。但变化的是梯度:它越来越小。因此,曲线上的跳跃也会逐渐变小。

The step size, η, must be some small number, a fraction (say, 0.1). Why? Mainly because as you near the bottom, you want to be very careful you don’t overshoot the minimum and end up higher on the other side of the curve. If you do so, depending on the function, the algorithm might begin taking you higher up the curve, away from the minimum. Also, notice that though the step size is the same at each iteration of the algorithm, the size of the jumps along the curve is greater in the beginning and becomes smaller as you near the bottom. Again, why? It’s because we are subtracting a multiple of the gradient from the x-coordinate to get a new x-coordinate. The multiple, or step size, doesn’t change in our algorithm. But what’s changing is the gradient: It’s getting smaller. So, the jumps along the curve also get progressively smaller.

像上图这样的函数,只有一个明确定义的最小值,也被称为凸函数。当我们找到碗底时,从技术上讲,我们已经找到了函数的“全局”最小值。(如果一个函数有多个最小值,那么每个最小值都称为“局部”最小值。)

Functions like the one depicted above, which have a single, well-defined minimum, are also called convex functions. When we find the bottom of the bowl, technically, we have found the “global” minimum of the function. (If a function has multiple minima, then each of these is called a “local” minimum.)

现在考虑最小化涉及一个接受两个输入的函数的情况。该函数如下:

Now consider the case when the minimization involves a function that takes two inputs. Here’s the function:

z = x2 + y2

z = x2 + y2

该图显示了碗状的三维表面,称为椭圆抛物面。如果您从碗底上方的任何位置开始,都可以轻松看到沿该抛物面表面的下降过程。与二维情况的不同之处在于,我们使用任意位置的梯度来计算新的 x 和 y 坐标,而不仅仅是 x 坐标。(相同操作:从每个坐标中减去梯度乘以某个步长。)然后,这会给我们一个新的 z坐标,然后我们下降到表面上的新位置。反复执行此操作,我们就会到达碗底。

The plot shows the bowl-shaped 3D surface, called an elliptic paraboloid. If you start from any location above the bottom of the bowl, the descent along the surface of this paraboloid can be easily visualized. The difference from the 2D case is that we use the gradient at any location to calculate the new x and y coordinates, instead of just the x-coordinate. (Same operation: Subtract gradient times some step size from each coordinate.) This then gives us a new z-coordinate, and we come down to a new location on the surface. Doing this iteratively gets us to the bottom of the bowl.

这是一个不同的函数,它可以帮助您直观地理解为什么某些函数可能无法找到最小值。

Here’s a different function to help visualize why it may not be possible to find the minimum for certain functions.

3D 表面由以下方程定义:

The 3D surface is defined by this equation:

z = y 2 - x 2

z = y2 - x2

这个表面被称为双曲抛物面。注意它看起来像一个马鞍:一部分是凸面,一部分是凹面。在上图中,我们从当前位置开始下降,表面上到达一个梯度为零的地方。但这是一个不稳定的地方。它被称为鞍点。一步走错,你就会从表面上跌落。这个函数没有全局或局部最小值。此外,初始起点可以决定你在下降时是否接近鞍点。以这个场景为例:

The surface is called a hyperbolic paraboloid. Notice that it looks like a saddle: part convex surface and part concave. In the figure above, we start descending from our location and ostensibly reach a place where the gradient is zero. But this is an unstable place. It’s called a saddle point. One false step, and you’ll tumble down the surface. This function has no global or local minimum. Also, the initial starting point can dictate whether you even come close to the saddle point while descending. Take this scenario, for instance:

在这种情况下,尝试按照相同的技术下降梯度(因为您从其他地方开始)可能会导致您偏离鞍点。

In this case, attempting to descend the gradient by following the same technique (because you start somewhere else) can cause you to veer away from the saddle point.

所有这些听起来可能非常抽象,但梯度下降不仅对 Widrow 和 Hoff 的算法至关重要,而且对现代机器学习也至关重要。但在我们将梯度下降与 Widrow 和 Hoff 的工作联系起来之前,我们需要解决一个重要细节。

All this may sound terribly abstract, but gradient descent is crucial not only to Widrow and Hoff’s algorithm, but also to modern machine learning. But before we connect gradient descent to Widrow and Hoff’s work, there’s one important detail we need to address.

再次看一下这个函数:

Take this function again:

z = x2 + y2

z = x2 + y2

回想一下,当我们有一个单变量函数(y = x 2)时,我们可以使用微积分来确定导数()并使用该值执行梯度下降。但是当函数涉及多个变量时我们该怎么办?好吧,有一个所谓的多变量或多变量微积分的整个领域。虽然全面面对多变量微积分可能令人望而生畏,但我们可以通过关注一些简单的想法来理解它在机器学习中发挥的核心作用。

Recall that when we had a function with one variable (y = x2), we could use calculus to determine the derivative () and use this value to perform gradient descent. But what do we do when the function involves multiple variables? Well, there’s an entire field of so-called multi-variable, or multi-variate, calculus. And while it can be daunting to confront multi-variate calculus in its entirety, we can appreciate the central role it plays in machine learning by focusing on some simple ideas.

假设你站在椭圆抛物面的某个点,z = x 2 + y 2。为了找出最陡下降的方向,我们必须关注两个方向,因为我们有两个变量。按照汤普森的劝告,用简单的方式陈述事物,我们知道沿着表面移动意味着变量 z 的值发生微小变化因此,我们的工作是计算;或分别是“ z的微小变化除以x的微小变化”和“ z的微小变化除以y的微小变化”。

Imagine you are standing at some point on the surface of an elliptic paraboloid, z = x2 + y2. To figure out the direction of steepest descent, we must be concerned about two directions, given that we have two variables. Following Thompson’s exhortation to state things in simple ways, we know that moving along the surface means a small change in the value of the variable z. So, our job is to calculate ; or a “tiny change in z divided by a tiny change in x” and a “tiny change in z divided by a tiny change in y,” respectively.

用微积分术语来说,我们取zx 的偏导数,以及z对y的偏导数对于我们的椭圆抛物面,偏导数为:

In calculus-speak, we are taking the partial derivative of z with respect to x, and the partial derivative of z with respect to y. For our elliptic paraboloid, the partial derivatives are: .

另外,请注意所用符号的细微变化:∂ x代替dx,∂ y代替dy。弯曲的“d”表示函数 wrt(“相对于”)众多变量之一的偏导数。为了概念性地理解接下来的内容,我们不必担心如何导出这些偏导数。只要知道给定可微函数,微积分就会向我们展示如何获得这些解析表达式就足够了。

Also, note the slight change in the symbol used: ∂x instead of dx and ∂y instead of dy. The curvy “d” signifies a partial derivative of a function w.r.t. (“with respect to”) one of many variables. For a conceptual understanding of what comes next, we don’t need to worry about how to derive these partial derivatives. It’s enough to know that given differentiable functions, calculus shows us how to get at these analytical expressions.

这里最重要的概念是,在这个例子中,最陡下降方向由两个偏导数给出。假设你站在一个位置:

The most important concept here is that the direction of steepest descent, for this example, is given by two partial derivatives. Let’s say you are standing at a location where:

x = 3、y = 4 且z = 3 2 + 4 2 = 25

x = 3, y = 4, and z = 32 + 42 = 25

在此位置,两个偏导数具有以下值:

At this location, the two partial derivatives have the values:

2 x = 2 × 3 = 6

2x = 2 × 3 = 6

2 y = 2 × 4 = 8

2y = 2 × 4 = 8

如果你以这种形式写这些数字,它看起来很熟悉:[6, 8]。它是一个向量!

If you write these numbers in this form, it looks like something very familiar: [6, 8]. It’s a vector!

因此,如果您必须稍微向下降最陡的方向移动,则可以从该向量推断出该方向。回想一下,向量具有大小(或长度)和方向。在这种情况下,我们的向量是从 [0, 0] 到 [6, 8] 的箭头。这个向量称为梯度。一个技术要点:梯度指向远离最小值的地方。因此,要向最小值下降,您必须采取朝相反方向迈出一小步,或者沿着梯度的负方向走。

So, if you have to move slightly in the direction of steepest descent, that direction can be inferred from this vector. Recall that a vector has a magnitude (or length) and a direction. In this case, our vector is an arrow going from [0, 0] to [6, 8]. This vector is called the gradient. One technical point: The gradient points away from the minimum. So, to go down toward the minimum, you must take a small step in the opposite direction or follow the negative of the gradient.

如果要从这次讨论中得出一个结论,那就是:对于多维或高维函数(即由多个变量组成的函数),梯度由向量给出。向量的分量是该函数关于每个变量的偏导数。

If there’s one thing to take away from this discussion, it’s this: For a multi-dimensional or high-dimensional function (meaning, a function of many variables), the gradient is given by a vector. The components of the vector are partial derivatives of that function with respect to each of the variables.

对于我们的椭圆抛物面,梯度写为:

For our elliptical paraboloid, the gradient is written as:

梯度可以写成行向量或列向量。

The gradient can be written as either a row vector or a column vector.

我们刚刚看到的东西非常强大。如果我们知道如何对函数的每个变量求偏导数,无论有多少个变量或函数有多复杂,我们都可以把梯度表示为行向量或列向量。为了说明这种方法的强大之处,请考虑这个稍微复杂一点的方程:

What we have just seen is extraordinarily powerful. If we know how to take the partial derivative of a function with respect to each of its variables, no matter how many variables or how complex the function, we can always express the gradient as a row vector or column vector. Just to illustrate the power of this approach, consider this slightly more complicated equation:

fxyz)= x 2 + 3 y 3 + z 5

f (x, y, z) = x2 + 3y3 + z5

函数f取决于三个变量,绘制在四维空间中。我们无法直观地看到它是什么样子。仅通过查看方程式,我们无法判断该函数是否具有我们可以下降到的全局最小值。但可以使用偏导数写下梯度。(再次说明,我们并不是想弄清楚如何准确地对每个变量微分函数;我们假设如果函数可以微分,微积分就会提供答案。您可以使用 Wolfram MathWorld 来找到这些导数。)

The function f depends on three variables and is plotted in 4D space. There is no way for us to visualize what it looks like. And just by looking at the equation, it’s impossible to tell whether the function has a global minimum toward which we can descend. But it’s possible to write down the gradient using the partial derivatives. (Again, we are not trying to figure out how exactly to differentiate the function with respect to each variable; let’s take it that if the function can be differentiated, calculus will provide an answer. You can use Wolfram MathWorld to find these derivatives.)

现在,给定一组x、yz 的值,我们可以计算该点处的函数梯度,朝相反方向迈出一小步,然后更新x、yz 的值。如果函数具有全局最小值或局部最小值,则迭代此过程将使我们到达那里。我们的分析还将两个重要概念联系起来:一方面是函数,另一方面是向量。记住这一点。这些看似完全不同的数学领域——向量、矩阵、线性代数、微积分、概率和统计以及优化理论(我们尚未涉及后两者)——都将在我们理解机器学习的原因时结合在一起。

Now, given some set of values for x, y, and z, we can evaluate the gradient of the function at that point, take a small step in the opposite direction, and update the values of x, y, and z. If the function has a global minimum or local minima, iterating over this process will get us there. Our analysis has also connected the dots between two important concepts: functions on the one hand and vectors on the other. Keep this in mind. These seemingly disparate fields of mathematics—vectors, matrices, linear algebra, calculus, probability and statistics, and optimization theory (we have yet to touch upon the latter two)—will all come together as we make sense of why machines learn.

神经元的闪光

GLIMMERS OF A NEURON

1956 年,伯纳德·威德罗从达特茅斯人工智能大会回来,用他的话来说,他心怀忐忑:想要制造一台能够思考的机器。“这个愿望一直都在,”六十多年后他告诉我。“我从来没有真正摆脱过这个愿望。”然而,在 1956 年,年轻的威德罗足够聪明,意识到制造思考机器是徒劳的,于是他转向了更实际的事情。制造自适应滤波器就是其中之一。

Bernard Widrow came back from the 1956 AI conference at Dartmouth with, as he put it, a monkey on his back: the desire to build a machine that could think. “It’s always there,” he told me more than six decades later. “I haven’t really ever gotten that out of my system.” Yet, in 1956, a young Widrow was savvy enough to realize the futility of building thinking machines, and he turned to more practical things. Building an adaptive filter was one such pursuit.

在信号处理领域,滤波器是一种接收输入信号、对其进行处理并产生具有某些所需属性的输出信号的东西。假设您正在使用一些业余电子设备,并且需要测量信号。但是,信号中混杂着 60 Hz 频率的恼人嗡嗡声。这是交流电源的干扰。滤波器可以接收带有噪声的输入,仅去除 60 Hz 分量,然后输出干净的信号。这种滤波器很容易设计,因为噪声很容易理解;它总是 60 Hz。但通常,滤波器需要了解噪声的特性;它需要适应。

In the field of signal processing, a filter is something that takes an input signal, processes it, and produces an output signal that has certain desired properties. Let’s say you are working on some hobby electronics equipment, and you need to measure a signal. But mixed in with your signal is an annoying hum at a frequency of 60 Hz. That’s interference from the AC mains power. A filter can take the noise-laden input, remove only the 60 Hz component, and spit out a clean signal. Such a filter is easy to design, as the noise is well understood; it’s always at 60 Hz. But often, a filter needs to learn the characteristics of the noise; it needs to adapt.

考虑这种自适应滤波器的一个重要应用:数字通信。任何曾经使用拨号调制解调器连接到互联网的人都会记得调制解调器发出的独特声音。首先是拨号音,然后是拨号音,接着是嘟嘟声和一阵阵断断续续的尖叫声,大约 20 秒后一片寂静。这就是握手的声音:两个数字设备在通常用于模拟语音信号的电话线上找出相互通话的最佳方式。数字设备必须发送和接收零和一的流。但模拟传输线可能会很嘈杂 - 因此,您需要一个滤波器来消除可能破坏数据的噪声。这包括取消调制解调器可能听到的其自身传输的任何回声。但为此目的构建通用滤波器是不可能的。在两个通信设备的每个实例中,噪声可能不同,而且通常都是不同的。握手过程中,两端的自适应滤波器会找出噪声的特征,然后消除噪声,从而创建几乎无错误的通信通道。(Widrow 回忆起自己使用传真机与远程传真机通信时发出的“握手”声;当时正好在他身边的孙子开始将这种握手声称为“爷爷音乐”。)

Consider an important application for such an adaptive filter: digital communications. Anyone who has ever used a dial-up modem to connect to the internet will remember the distinctive sounds made by the modem. First a dial tone, then the tones of the number being dialed, followed by beeps and bursts of staccato screeches, and then silence after about twenty seconds. That’s the sound of a handshake: two digital devices figuring out the best way to talk to each other over a phone line usually used for analog voice signals. The digital devices must transmit and receive streams of zeroes and ones. But analog transmission lines can be noisy—so, you need a filter to remove the noise that could corrupt the data. This includes canceling any echo that a modem may hear of its own transmissions. But it’s impossible to build a generic filter for such purposes. The noise can be, and often is, different in each instance of two communicating devices. Part of what happens during a handshake is that an adaptive filter at each end figures out the characteristics of the noise, which it can then remove to create an almost error-free communication channel. (Widrow recalls using a fax machine that made these “handshake” sounds when communicating with some remote fax machine; his grandson, who happened to be around him in those days, started calling the sounds of the handshake “Grandpa music.”)

下面显示了自适应滤波器的一种设计。

One design of an adaptive filter is shown below.

这里,xn是输入信号;yn代表相应的输出滤波器将xn转换为yn。将输出与期望信号dn(滤波器本应产生的信号)进行比较。yn和dn之间的任何差异都会导致错误en。

Here, xn is the input signal; yn represents the corresponding output. The filter turns xn into yn. The output is compared against a desired signal dn, which is the signal the filter ought to have produced. Any discrepancy between yn and dn results in an error en.

en = dn - yn

en = dn - yn

该误差en被反馈到滤波器中。自适应滤波器会改变自身,以使误差最小化。名为 FILTER 的黑匣子具有一些特性或参数,可以调整这些参数以使滤波器具有自适应性。

This error en is fed back into the filter. An adaptive filter changes itself such that the error is minimized. The black box named FILTER has some characteristics, or parameters, and these parameters can be tweaked to make the filter adaptive.

您可能会说,如果知道所需信号,那么滤波器还有什么意义呢?好吧,您不知道任何通用输入的所需信号。但有办法知道滤波器应该为已知输入产生什么。例如,这就是调制解调器在握手期间所做的事情:它们传输先前商定的信号,以便另一端知道会发生什么。这就是所需信号dn。但信号通过嘈杂的传输线到达,因此输入xn只是被噪声污染的dn 但与我们之前看到的 60 Hz 嗡嗡声不同,这种噪声是随机的。接收器需要一个滤波器,将xn作为输入并产生尽可能接近所需信号dn 的信号yn 。为此,算法必须学习噪声的统计特性,以便它可以预测每个时间步的噪声并实时从xn中减去它以产生所需信号。

You might say, if you know the desired signal, what’s the point of a filter? Well, you don’t know the desired signal for any generic input. But there are ways to know what the filter should produce for known inputs. For example, that’s what modems do during the handshake: They transmit a previously agreed-upon signal, so that the other side knows what to expect. That’s the desired signal dn. But the signal arrives over a noisy transmission line, so the input xn is simply dn contaminated by noise. But unlike the 60 Hz hum we looked at earlier, this noise is random. The receiver needs a filter that takes xn as an input and produces a signal yn that is as close to the desired signal dn as possible. To do so, the algorithm must learn the statistical properties of the noise, so that it can predict the noise at each time step and subtract it in real time from xn to produce the desired signal.

虽然这一切与人工智能和机器学习相去甚远,但我们可以看到机器学习的曙光。当我们写下过滤器的细节时,这种联系(尤其是与罗森布拉特的感知器和人工神经元的联系)将变得更加明显。

While all this is a far cry from AI and ML, we can see glimmers of machines that learn. This connection—particularly to Rosenblatt’s perceptron and artificial neurons—will become even more obvious when we write down the particulars of a filter.

当 Widrow 还在麻省理工学院时,他也逐渐意识到了这一点。他深受滤波器设计泰斗 Norbert Wiener 的影响。当时,Wiener 是麻省理工学院最知名的教授。几十年后,威德罗在一本书中回忆了维纳的性格,书中描绘了一个特别令人回味的画面:维纳走在麻省理工学院大楼的走廊里,他的头经常“飘飘欲仙”,无论是字面意义上还是比喻意义上:“我们每天都会看到他在那里,他总是叼着一支雪茄。他会在走廊里走着,抽着雪茄,雪茄与地面呈 45 度角。他从不看自己在往哪儿走……但他会不停地抽着烟,头被烟雾笼罩着,他只是处于恍惚状态。当然,他正在推导方程式。”即使当他走近走廊尽头的台阶时,维纳也会向上看,而不是向下看。“你可以看到他要自杀了——他会从那些台阶上摔下来——但如果你打扰他,你可能会打断他的思路,让科学倒退十年!这个问题一直存在。”

This was slowly becoming obvious to Widrow, too, while he was still at MIT, where he was deeply influenced by the doyen of filter design, Norbert Wiener. At the time, Wiener was MIT’s best-known professor. Decades later, Widrow, recalling Wiener’s personality in a book, painted a particularly evocative picture of a man whose head was often, literally and metaphorically, “in the clouds” as he walked the corridors of MIT buildings: “We’d see him there every day, and he always had a cigar. He’d be walking down the hallway, puffing on the cigar, and the cigar was at angle theta—45 degrees above the ground. And he never looked where he was walking…But he’d be puffing away, his head encompassed in a cloud of smoke, and he was just in oblivion. Of course, he was deriving equations.” Even as he approached the steps at the end of some hallway, Wiener would be looking up, not down. “You can see he’s going to kill himself—he’s going to fall down those steps—but if you disturb him, you might break his train of thought and set science back like ten years! There was always that problem.”

尽管面临如此生死攸关的决定,威德罗还是接受了维纳的工作。在麻省理工学院期间,他甚至设计出了不同版本的自适应滤波器。以下是他的一个设计示例:

Such life-and-death decisions notwithstanding, Widrow embraced Wiener’s work. While at MIT, he even came up with different versions of the adaptive filter. Here’s an example of one of his designs:

在滤波器中,输入信号xn每n个时间步长离散到达一次(可以是任意值:每天一次、每秒一次、每每个框表示一个时间步长(例如,每毫秒、每微秒等等),yn是相应的输出。每个标有 DELAY 的框接收一个信号并将其延迟一个时间步长,从xn产生信号xn1,从xn1 产生信号xn2 经过一次延迟后,信号乘以权重w1,经过两次延迟后,乘以w2,依此类推。未延迟的信号乘以w0。所有这些都加起来。因此,对于上一页的示例,输出信号yn可以写成:

In the filter, the input signal xn arrives discretely once every nth time step (which can be anything: once per day, per second, per millisecond, per microsecond, and so on), and yn is the corresponding output. Each box labeled DELAY takes a signal and delays it by one time step, producing the signal xn1 from xn and xn2 from xn1. After one delay, the signal is multiplied by a weight w1, after two delays it is multiplied by w2, and so on. The undelayed signal is multiplied by w0. All these are summed up. So, for our example on the preceding page, the output signal yn can be written as:

yn = w 0 .xn + w 1 .xn 1 + w 2 .xn 2

yn = w0.xn + w1.xn1 + w2.xn2

我们可以将 [ w0, w1, w2 ] 视为向量w,将 [ xn, xn1, xn2 ] 视为向量xn。然后,

We can treat [w0, w1, w2] as the vector w, and [xn, xn1, xn2] as the vector xn. Then,

w.xn = w 0.xn + w 1.xn1 + w 2.xn2

w.xn = w0.xn + w1.xn1 + w2.xn2

该图仅显示了两个延迟,但原则上可以有任意数量的延迟。现在,如果dn是所需信号,那么您可以按照以下方法优化滤波器的参数,以尽量减少其生成的信号yn与所需信号dn之间的误差

The diagram shows only two delays, but in principle, there can be any number of them. Now, if dn is the desired signal, here’s how you would go about optimizing the filter’s parameters, to minimize the error between what it generates, which is yn, and the desired signal, dn.

yn = w.xn,其中:

yn = w.xn, where:

xn = [ xn, xn1 , …]

xn = [xn, xn1, …]

w = [ w0, w1, …]

and w = [w0, w1, …]

en = dn - yn

en = dn - yn

en = dn - w.xn

en = dn - w.xn

我们得到的是滤波器在第 n 个时间步产生的误差的表达式。很明显,如果滤波器预测出所需信号的良好近似值,则误差将最小化。为了实现这一点,滤波器必须学习每个时间步的w值。当然,这样的滤波器可以在预测错误时更新其参数——因此得名“自适应滤波器”。它会学习。理想情况下,随着时间的推移,滤波器的平均误差应该趋向于零。(也许现在与机器学习的联系开始从滤波器理论的迷雾中浮现出来。)

What we have is an expression for the error that the filter makes at the nth time step. It’s clear that if the filter predicts a good approximation of the desired signal, then the error will be minimized. To achieve this, the filter must learn the value for w at each time step. Of course, such a filter can update its parameters whenever it gets the prediction wrong—hence the name “adaptive filter.” It learns. Ideally, over time, the average error made by the filter should tend toward zero. (Maybe now the connections to machine learning are beginning to emerge from this fog of filter theory.)

ML 的一个小题外话:我们应该如何计算平均误差?将误差相加来计算平均值是不够的;正如我们之前看到的,负误差和正误差可以互相抵消,从而给人一种平均误差很低的印象。我们可以将误差的绝对值相加并取平均值:这被称为平均绝对误差 (MAE)。但数学家更喜欢取误差项平方的平均值,并将其称为均方误差 (MSE)。事实证明,MSE 具有一些 MAE 所没有的良好统计特性。此外,MSE 处处可微。(可微函数是在其域的任何地方都有导数的函数,其中域可以是 xy 平面。)MAE 不是。这也有很大帮助,我们将在训练神经网络时看到这一点。还有一个值得一提的事实:如果你希望你的错误估计能够惩罚极端异常值,那么 MSE 比 MAE 做得更好,因为在 MSE 中,误差对平均值的贡献随着误差的平方而增加,而在 MAE 中,误差对平均值的贡献则呈线性增加。

A small ML digression: How should we calculate the average error? Adding the errors to calculate the mean won’t suffice; as we saw earlier, negative and positive errors can negate each other, giving an invalid impression that the average error is low. We could add the absolute value of the errors and take the average: this is called the mean absolute error (MAE). But the math folk prefer to take the average of the square of the error terms and call it a mean squared error (MSE). It turns out that the MSE has some nice statistical properties that the MAE doesn’t. Also, the MSE is differentiable everywhere. (A differentiable function is one that has a derivative everywhere in its domain, where a domain could be, say, the xy plane.) The MAE is not. This, too, helps immensely, and we’ll see this when we come to training neural networks. One more fact worth mentioning: If you want your error estimate to punish the extreme outliers, then the MSE does that better than the MAE, because the contribution of an error to the average increases as the square of the error in the MSE, while it increases linearly in the MAE.

回到过滤器。我们在每个时间步骤中对误差求平方,将所有平方误差相加,然后找到预期值。随机变化的事物的预期值或“期望”在概率论中具有非常具体的含义,但我们不必担心就是这个。这里的关键点是,我们需要最小化平方误差的期望值 ( E )。我们把这个值称为J:

Back to the filter. We square the error at each time step, add all the squared errors, and then find the expected value. The expected value, or “expectation,” of something that is varying randomly has a very specific meaning in probability theory, but let’s not worry about that. The key insight here is that we need to minimize the expected value (E) of the squared errors. Let’s call that value J:

J = E ( en 2 )

J = E(en2)

⇒ J = E (( dn - yn ) 2 )

⇒ J = E((dn - yn)2)

⇒ J = E (( dn - w.xn ) 2 )

⇒ J = E((dn - w.xn)2)

必须最小化J的值。如果你看一下将J与滤波器参数w关联起来的方程的形式,就会发现连接两者的函数将是二次函数(这意味着它将涉及w的二次方)。我们已经看到这样的二次函数是凸的(例如y = x 2z = x 2 + y 2 )。因此,当J最小化时,我们最终会到达某个碗状函数的底部。此时, J的斜率或梯度为零。这为我们提供了另一种寻找w最佳值的方法。我们可以简单地将J相对于w的梯度值设置为零,解方程:

The value of J must be minimized. If you look at the form of the equation that relates J to the filter parameter w, it becomes clear that the function that connects the two will be quadratic (meaning, it’ll involve the second power of w). We already saw that such quadratic functions are convex (y = x2, or z = x2 + y2, for example). So, when J is minimized, we end up at the bottom of some bowl-shaped function. At this point, the slope, or gradient, of J is zero. This gives us another way of finding the optimal value for w. We can simply set the value of the gradient of J with respect to w to zero and solve the equation:

1931 年,维纳和德国数学家埃伯哈德·霍普夫 (Eberhard Hopf) 设计出一种利用线性代数技巧解决此类方程的方法。但这需要一些关于所有不同时间步骤的输入之间的相关性以及输入与期望输出之间的相关性的先验知识。这些并不总是已知的,即使知道,计算也可能非常耗时。此外,维纳的工作还适用于模拟滤波器。

In 1931, Wiener and German mathematician Eberhard Hopf devised a way to solve such equations, using techniques from linear algebra. But this requires some a priori knowledge about the correlation between the inputs at all the various time steps and the correlation between the inputs and the desired outputs. This is not always known, and even when it is, the calculations can be computationally intensive. Also, Wiener’s work applied to analog filters.

我们也可以使用最速下降法来最小化J。为什么呢?因为它是一个碗状凸函数,所以我们通过迭代地沿着一条路径下降到碗底,总能找到使平方误差的期望值最小化的w值。因此,无论滤波器的特征是一个系数 ( w0 )、两个 ( w0, w1 )、三个 ( w0, w1, w2 ) 还是更多,这个断言都成立。最速下降法可以让你找到最小值。但这种方法也有一个限制:我们需要能够计算J对滤波器系数的偏导数。

We can also minimize J by using the method of steepest descent. Why is that? Well, because it’s a bowl-shaped, convex function, we can always find the value for w that minimizes the expectation value of the squared errors by iteratively following a path down to the bottom of the bowl. So, regardless of whether the filter is characterized by one coefficient (w0), two (w0, w1), three (w0, w1, w2), or more, the assertion holds. Steepest descent will let you find the minimum. But this method, too, has a limitation: We need to be able to calculate the partial derivatives of J with respect to the filter coefficients.

还存在其他计算问题。

There are other computational concerns.

例如,给定xn和相应的yn,加上期望输出dn,可以使用最速下降法来计算参数(在我们的例子中是:w0、w1、w2)。问题在于,要找到参数的最优值,你需要越来越多的输入、输出和期望输出样本,而这些计算需要越来越长的时间才能完成。

For instance, given xn and the corresponding yn, plus the desired output dn, one can use the method of steepest descent to calculate the parameters (in our example: w0, w1, w2). The trouble is that to find the optimal values for the parameters, you need more and more samples of input, output, and desired output, and these calculations take increasingly longer to finish.

此外,鉴于针对特定数据样本计算出的误差并不能完全代表所有可能的误差,因此您在每个时间步骤中计算出的趋向最小值的梯度只是一个近似值。有时它指向正确的方向,但大多数时候都不是。在我们沿着梯田山坡走到村庄的类比中,除了必须在黑暗中导航之外,您还好像有点醉了。您没有选择最陡峭的路径走到下一个梯田,而是摇摇晃晃地走下去。您甚至可能爬上下一个梯田。希望如果您迈出足够小的步伐,即使是醉汉的步伐也能带您走到村庄。在实践中,这样做的算法确实成功了。这种方法称为随机梯度下降 (SGD),其中“随机”一词指的是您下降过程中每一步的方向都略有随机性。

In addition, given that the error calculated for a certain sample of data doesn’t fully represent all possible errors, the gradient you calculate at each time step to go toward the minimum is only an approximation. Sometimes it’s pointing in the right direction, but most times it’s not. In our analogy of walking down a terraced hillside to the village, it’s as if in addition to having to navigate in the dark, you were somewhat drunk. Instead of taking the steepest path down to the next terrace, you stagger down willy-nilly. You might even clamber up to the next terrace. The hope is that if you take small enough steps, even this drunkard’s walk will get you down to the village. And in practice, algorithms that do this indeed succeed. This method is called stochastic gradient descent (SGD), where the word “stochastic” refers to the fact that the direction of each step in your descent is slightly random.

这正是 Widrow 在麻省理工学院 (MIT) 工作时(后来才转到斯坦福大学)研究的课题。除了滤波器,他还在思考自适应神经元或神经元素,并意识到训练神经元与训练滤波器并无二致。

This is what Widrow was working on while at MIT, before he moved to Stanford. But alongside filters, he was thinking about adaptive neurons, or neural elements, and realizing that training a neuron was no different from training a filter.

与伯尼共度周末

A WEEKEND WITH BERNIE

1959 年秋天,当泰德·霍夫走进斯坦福大学威德罗的办公室时,威德罗开始与他讨论这些想法。“所以,我站在黑板前向泰德解释随机梯度和二次碗……以及自适应滤波器和自适应神经元件……讨论如何微分以获得梯度的成分,”威德罗告诉我。“我不知道这是怎么发生的,但我们想出了一个主意,我们可以得到一个随机梯度,一个非常粗略的代数梯度——不需要微分任何东西,不需要平均任何东西,也不需要平方任何东西。”

When Ted Hoff walked into Widrow’s office at Stanford that fateful day in the fall of 1959, Widrow began discussing such ideas with him. “So, I was up at the blackboard explaining to Ted about the stochastic gradient and quadratic bowl…and adaptive filters and adaptive neural elements and…talking about how you differentiate to get the components of the gradient,” Widrow told me. “I don’t know how it happened, but we came up with an idea that we could get a stochastic gradient, a very crude gradient algebraically—without differentiating anything, without averaging anything, and without squaring anything.”

他们设计的技术可以应用于自适应滤波器或人工神经元。到目前为止,我们已经了解到示例自适应滤波器的输出由以下公式给出:

The technique they devised could be applied to adaptive filters or artificial neurons. So far, we have learned that the output of our example adaptive filter is given by:

yn = w 0.xn + w 1.xn 1 + w 2.xn 2

yn = w0.xn + w1.xn1 + w2.xn2

或者,yn = w.xn

or, yn = w.xn

设计一个自适应的过滤器需要学习w0、w1w2的值。如果你回想一下罗森布拉特的感知器,你会发现它也需要学习权重,以便能够正确地将新数据归类为落在超平面的一侧或另一侧。罗森布拉特的算法不是梯度下降的。但 Widrow 和 Hoff 算法是。

Designing a filter that adapts involves learning the values for w0, w1, and w2. If you cast your mind back to Rosenblatt’s perceptron, you’ll see that it, too, involves learning the weights so that it can correctly classify a new piece of data as falling on one or the other side of a hyperplane. Rosenblatt’s algorithm isn’t cast in terms of gradient descent. But the Widrow and Hoff algorithm is.

下图展示了 Widrow 和 Hoff 设计的自适应神经元的一种思考方式。

The figure below shows a way to think about the adaptive neuron designed by Widrow and Hoff.

神经元产生输出y:

The neuron produces an output y:

y = w 0 x 0 + w 1 x 1 + w 2 x 2

y = w0x0 + w1x1 + w2x2

这里,x0始终为 1;这使得w0成为我们的偏差项b。实际输入是x1x2。它们一起构成向量x 。系数w0、w1w2的集合是向量w。因此:

Here, x0 is always 1; this makes w0 our bias term, b. The actual inputs are x1 and x2. Together, they make up the vector x. The set of coefficients w0, w1, and w2 is the vector w. So:

y = wx

y = w.x

y = wT x

⇒ y = wTx

假设您有多个训练样本,这些样本有输入和相应的期望输出(d)。那么,自适应神经元对每个输入的误差由以下公式给出:

Assume you have several training samples for which you have the inputs and the corresponding desired outputs (d). Then, the error made by the adaptive neuron for each input is given by:

误差e= d - y = d - wTx

error(e) = d - y = d - wTx

考虑一下输入是一组 16 个值的问题,这些值代表一个 4×4 像素网格。这些像素可用于显示字母表中的字母。例如,字母“T”会点亮其中一些像素(这意味着,一些像素的值为“1”,而其他像素的值为“0”)。字母“J”会点亮另一组像素。

Consider the problem where the input is a set of 16 values, representing a 4×4 grid of pixels. These pixels can be used to show letters of the alphabet. The letter “T,” for example, would light up some of those pixels (meaning, some pixels would have the value “1,” and others “0”). The letter “J” would light up a different set of pixels.

假设代表“T”的像素集是输入对于神经元,它必须输出值 1。而当输入是代表字母“J”的像素值集时,神经元必须输出 -1。因此,“T”的期望输出是 1,而“J”的期望输出是 -1。

Let’s say that when the set of pixels representing “T” is the input to the neuron, it must output the value 1. And when the input is the set of pixel values representing the letter “J,” the neuron must output -1. So, the desired output for “T” is 1, and for “J,” it’s -1.

训练神经元需要每次为其提供一个代表一个字母的输入。算法使用输入和期望的输出来调整其权重并生成正确的输出。但是,更改权重以获得输入字母“T”的正确输出可能会导致输入字母“J”出错。如果是这种情况,算法会再次调整其权重。当然,现在新的权重可能会导致输入字母“T”出错。重复该过程。此过程一直持续到神经元正确输出字母“T”的 1字母“J”的 -1。最速下降法可用于训练神经元。

Training the neuron involves supplying it one input, representing one letter, at a time. The algorithm uses the input and the desired output to adjust its weights and generate the correct output. But changing the weights so that you get the correct output for the input letter “T” may cause it to make an error for the input letter “J.” If that’s the case, the algorithm adjusts its weights again. Of course, now the new weights may cause an error for the input letter “T.” You repeat the process. And this goes on until the neuron correctly outputs 1 for the letter “T” and -1 for the letter “J.” The method of steepest descent can be used to train the neuron.

假设您有一堆训练样本:输入及其对应的输出。如果您计算神经元对所有输入样本的误差,并将平方误差的期望值绘制为所有权重或系数的函数,您将得到一个碗状函数(当然,在我们无法想象的高维空间中)。然后,您可以使用最速下降法最小化期望值。在每一步中,您计算​​函数相对于每个权重的梯度,然后通过朝相反方向(朝向最小值)迈出一小步来修改权重。

Let’s say you had a bunch of training samples: inputs and their corresponding outputs. If you calculate the errors made by the neuron for all input samples and plot the expectation value of squared errors as a function of all the weights, or coefficients, you get a bowl-shaped function (of course, in a higher-dimensional space that we cannot visualize). Then, you could minimize the expectation value using the method of steepest descent. At each step, you calculate the gradient of the function with respect to each weight and then modify the weights by taking a small step in the opposite direction (toward the minimum).

w= w+ μ (- Δ )

wnew = wold + μ(-Δ)

在哪里

where:

μ = 步长

μ = step size

Δ = 梯度

Δ = gradient

回想一下我们之前的讨论,梯度只是一个向量,其中每个元素都是均方误差J关于每个权重的偏导数。

Recall from our previous discussion that the gradient is simply a vector in which each element is the partial derivative of the mean squared error, J, with respect to each weight.

因此,对于我们的三个权重,梯度是:

So, for our three weights, the gradient is:

这个向量的每个元素都是一个解析表达式,可以使用微积分规则计算出来。有了表达式后,只需插入权重的当前值,即可获得梯度,然后可以使用该梯度计算新的权重。问题是:你需要微积分,虽然我们的梯度只有三个元素,但实际上,它可以有数十、数百、数千甚至更多的元素。Widrow 和 Hoff 追求更简单的东西。这就是他们想出的:

Each element of this vector will be an analytic expression that can be calculated using the rules of calculus. Once you have the expressions, you just plug in the current values for the weights, and you get the gradient, which you can then use to calculate the new weights. The problem: You need calculus, and while our gradient has only three elements, in practice, it can have elements that number in the tens, hundreds, thousands, or even more. Widrow and Hoff were after something simpler. This is what they came up with:

w= w+ μ (- Δ est )

wnew = wold + μ(-Δest)

他们决定不计算整个梯度,而是只计算它的估计值。估计值将仅基于一个数据点。它不涉及计算误差平方。他们只是估算了误差。但仅基于一个样本估算统计参数通常是令人厌恶的。即便如此,Widrow 和 Hoff 还是坚持这样做。经过一番分析,他们得出了权重的更新规则:

Instead of calculating the entire gradient, they decided to calculate only an estimate of it. The estimate would be based on just one data point. It didn’t involve calculating the expectation value of the error squared. Rather, they were simply estimating it. But estimating a statistical parameter based on just one sample is usually anathema. Even so, Widrow and Hoff went with it. With a little bit of analysis, they came up with their update rule for the weights:

w= w+ 2 με x

wnew = wold + 2μεx

在哪里
where:
μ = 步长
μ = step size
ε = 基于一个数据点的误差
ε = error based on one data point
x = 表示单个数据点的向量
x = the vector representing a single data point

错误本身由以下公式给出:

The error itself is given by:

ε = d - wT x

ε = d - wTx

这是简单的代数运算。基本上,对于每个输入,你都会计算误差,并用它来更新权重。

This is simple algebra. Basically, for each input, you calculate the error and use that to update the weights.

Widrow 和 Hoff 知道他们的方法非常近似。“你要做的就是取误差的单一值,然后求其平方,然后忍住不说,因为你要撒谎,然后你说这就是均方误差,”Widrow 告诉我。“这是误差平方平均值的一个非常嘈杂的版本。然后,当你求导数时,你可以用分析方法进行,而不需要微分。你不必求任何平方。你不必求任何平均值。你得到的是一个非常嘈杂的梯度。你迈出一小步,再迈出一小步,再迈出一小步。”

Widrow and Hoff were aware that their method was extremely approximate. “What you do is you take the single value of the error, square it, swallow hard, because you are going to tell a lie, [and] you say that’s the mean squared error,” Widrow told me. “It’s a pretty noisy version of the average of the square of the error. And then, when you take the derivatives, you can do it analytically, without differentiating. You don’t have to square anything. You don’t have to average anything. You got an extremely noisy gradient. You take a small step, another small step, another small step.”

然而,该算法让你接近最小值函数。它被称为最小均方 (LMS) 算法。在 Widrow 的视频中他在 2012 年上传的一段视频中解释了该算法,并称是他的一名研究生为该算法命名的,但他不记得那名学生的名字了。他还说:“我希望所有这些代数运算不会带来太多神秘感。一旦你习惯了,一切都很简单。但除非你看到代数运算,否则你永远不会相信这些算法真的能起作用。有趣的是,它们确实有效。LMS 算法用于自适应滤波器。这些是可训练的数字滤波器……世界上的每个调制解调器都使用某种形式的 LMS 算法。所以,这是地球上使用最广泛的自适应算法。”

And yet, the algorithm gets you close to the minimum of the function. It came to be called the least mean squares (LMS) algorithm. In a video Widrow uploaded in 2012, to explain the algorithm, he credited one of his graduate students for naming the algorithm, but he doesn’t remember the student’s name. He also said, “I hope that all this algebra didn’t create too much mystery. It’s all quite simple once you get used to it. But unless you see the algebra, you would never believe that these algorithms could actually work. Funny thing is they do. The LMS algorithm is used in adaptive filters. These are digital filters that are trainable…Every modem in the world uses some form of the LMS algorithm. So, this is the most widely used adaptive algorithm on the planet.”

LMS 算法不仅在信号处理中得到应用,而且还成为第一个使用最速下降法近似值来训练人工神经元的算法。具体来说:当今的每个深度神经网络(具有数百万、数十亿甚至数万亿个权重)都使用某种形式的梯度下降进行训练。从 LMS 算法到驱动 AI 的现代算法将是一条漫长的道路,但 Widrow 和 Hoff 已经奠定了第一块铺路石。

Not only would the LMS algorithm find uses in signal processing, but it would also become the first algorithm for training an artificial neuron that used an approximation of the method of steepest descent. To put this into context: Every deep neural network today—with millions, billions, possibly trillions of weights—uses some form of gradient descent for training. It would be a long road from the LMS algorithm to the modern algorithms that power AI, but Widrow and Hoff had laid one of the first paving stones.

然而,在 1959 年秋天那个周五的下午,他们只有黑板上一些数学化的涂鸦。威德罗和霍夫并不知道这个算法会奏效。他们需要在计算机上模拟它;他们很兴奋,因为他们发现了一些极其重要的东西。“我当时愚蠢地想:‘我们发现了生命的秘密’,”威德罗告诉我。

On that Friday afternoon in the autumn of 1959, however, all they had were mathematically motivated scribbles on a blackboard. Widrow and Hoff didn’t know that the algorithm would work. They needed to simulate it on a computer; they were excited that they had uncovered something extremely important. “Foolishly I was thinking: ‘We’ve discovered the secret of life,’ ” Widrow told me.

办公室对面的走廊上放着一台模拟计算机,这是洛克希德公司送给斯坦福大学的礼物。门是开着的,任何人都可以使用这台计算机。对它进行编程类似于操作老式电话交换机:从接线板上取一根电线,插到那里,等等。半个小时后,霍夫就让算法在模拟机器上运行起来。“他让它运行起来了,”威德罗说。说道。“我不知道他是怎么知道的。他知道如何对那个东西进行编程。”

Across the hall from his office was an analog computer, a gift to Stanford from Lockheed. The door was open, and anyone could use the computer. Programming it was akin to operating an old-fashioned telephone switchboard: Take a wire out from a patch panel here, plug it in there, and so on. In a half hour, Hoff had the algorithm running on the analog machine. “He made it work,” Widrow said. “I don’t know how he knew how to do it. He knew how to program that thing.”

验证算法有效后,两人下一步要做的就是构建单个自适应神经元——一个真正的硬件神经元。但当时已经是傍晚了。斯坦福的仓库周末关闭。“好吧,我们等不及了,”Widrow 告诉我。第二天早上,他们两人走到帕洛阿尔托市中心的 Zack Electronics,买了所有需要的零件。然后他们去了 Hoff 的公寓,整个星期六和大部分星期天上午都在工作。到星期天下午,他们就让它工作了。“星期一早上,我把它放在我的桌子上,”Widrow 回忆道。“我可以邀请人们进来,向他们展示一台可以学习的机器。我们称之为 ADALINE——‘自适应线性神经元’。它……不是自适应过滤器,而是一个学会成为好神经元的自适应神经元。”

Having verified that the algorithm worked, the two had as their next step the building of a single adaptive neuron—an actual hardware neuron. But it was late afternoon. The Stanford supply room was closed for the weekend. “Well, we weren’t going to wait,” Widrow told me. The next morning, the two of them walked over to Zack Electronics, in downtown Palo Alto, and bought all the parts they needed. They then went over to Hoff’s apartment and worked all of Saturday and most of Sunday morning. By Sunday afternoon, they had it working. “Monday morning, I had it sitting on my desk,” Widrow recalled. “I could invite people in and show them a machine that learns. We called it ADALINE—‘adaptive linear neuron.’ It was…not an adaptive filter, but an adaptive neuron that learned to be a good neuron.”

ADALINE 使用 LMS 算法将输入空间(例如,由 4×4 或 16 个像素定义的 16 维空间)分成两个区域。一个区域是 16 维向量或点,它们表示字母“T”。另一个区域是代表字母“J”的向量。Widrow 和 Hoff 选择 4×4 像素来表示字母,因为这个像素足够大,可以清楚地显示不同的字母,但又足够小,因为他们必须手动调整权重(使用旋钮)。如果像素更大,他们就得花大部分时间来摆弄这些旋钮。再次重申,这是 4×4 像素空间中的字母“T”和“J”:

What ADALINE does, using the LMS algorithm, is to separate an input space (say, the 16-dimensional space defined by 4×4, or 16, pixels) into two regions. In one region are 16-dimensional vectors, or points that represent, say, the letter “T.” In another region are vectors that represent the letter “J.” Widrow and Hoff chose 4×4 pixels to represent letters, as this was big enough to clearly show different letters, but small enough to work with, given that they had to adjust the weights by hand (using knobs). Anything larger, and they’d have spent most of their time twiddling those knobs. Again, here are the letters “T” and “J” in 4×4-pixel space:

因此,每个字母由 16 个二进制数字表示,每个数字可以是 0 或 1。如果您想象将这些字母绘制为 16D 空间中的点,那么“J”将是坐标空间一部分中的点(矢量),而“T”将是另一部分中的点。LMS 算法帮助 ADALINE 找到表示线性分离超平面(在本例中为 15 维平面)的权重,该超平面将输入空间一分为二。这正是 Rosenblatt 的感知器使用不同算法所做的。

So, each letter is represented by 16 binary digits, each of which can be either 0 or 1. If you were to imagine plotting these letters as points in a 16D space, then “J” would be a point (vector) in one part of the coordinate space, and “T” in another. The LMS algorithm helps ADALINE find the weights that represent the linearly separating hyperplane—in this case, a plane in fifteen dimensions—that divides the input space into two. It’s exactly what Rosenblatt’s perceptron does, using a different algorithm.

虽然我们在第 2 章中看到的感知器收敛证明清楚地说明了为什么感知器会找到线性分离超平面(如果存在的话),但不清楚为什么粗糙的 LMS 算法有效。几年后,Widrow 正在新泽西州纽瓦克等待航班。他有一张联合航空的机票。“那时候,你的机票放在夹克里。上面有一些空白处。所以,我坐下来开始做一些代数运算,然后说,‘该死的,这是一个无偏估计。’”

While the perceptron convergence proof we saw in chapter 2 showed clearly why the perceptron finds the linearly separating hyperplane, if one exists, it wasn’t exactly clear why the rough-and-ready LMS algorithm worked. Years later, Widrow was waiting for a flight in Newark, New Jersey. He had a United Airlines ticket. “Those days, your ticket was in a jacket. And there was some blank space on it. So, I sat down and started doing some algebra and said, ‘Goddamn, this thing is an unbiased estimate.’ ”

他能够证明,如果你采取非常小的步长,LMS 算法就能得到​​答案:神经元或自适应滤波器权重的最优值。“通过使步长变小,步长变多,我们就能得到一种平均效应,让你落到碗底,”Widrow 说。

He was able to show that the LMS algorithm, if you took extremely small steps, got you to the answer: the optimal value for the weights of either the neuron or the adaptive filter. “By making the steps small, having a lot of them, we are getting an averaging effect that takes you down to the bottom of the bowl,” Widrow said.

霍夫在威德罗的指导下完成了博士学位,并在做博士后研究时,一家硅谷的小型初创公司向他发出了邀请。威德罗建议他接受这份工作。这是个明智的建议:这家初创公司就是英特尔。霍夫后来成为英特尔首款通用微处理器英特尔 4004 开发背后的关键人物之一。

Hoff finished his Ph.D. with Widrow and was doing his postdoctoral studies when a small Silicon Valley start-up came calling. Widrow told him to take the job. It was sound advice: The start-up was Intel. Hoff went on to become one of the key people behind the development of the company’s first general-purpose microprocessor, the Intel 4004.

Widrow 继续使用 LMS 算法构建自适应滤波器,用于消除噪音和用于天线,以消除噪音和干扰。他研究了 ADALINE(单层自适应神经元)和 MADALINE(代表“Many ADALINE”),它有三层:输入层、隐藏层和输出层。但训练 MADALINE 非常困难。尽管如此,Widrow 的工作还是引起了轰动。

Widrow continued using the LMS algorithm to build adaptive filters—for noise cancelation and for antennas that adapt to remove noise and interference. He worked on ADALINE (a single layer of adaptive neurons) and MADALINE (for “Many ADALINE”), which had three layers: input, hidden, and output. But it was hard to train MADALINE. Still, Widrow’s work began making waves.

1963 年,加州科学院制作了《科学在行动》节目,节目名为“会学习的计算机”,主持人 Earl S. Herald 介绍了一个似乎在平衡扫帚的机器人组件。“这可能看起来并不令人吃惊,因为任何人都可以平衡扫帚。但这是一台可以学会做到这一点的机器……这就是 MADALINE 的故事,这台机器在某些方面像男人一样思考[斜体字是我加的]。”当时的性别歧视再次抬头,节目刚开始两分钟,Herald 就向 Widrow 提问:“让我问一下‘ADALINE’这个名字。为什么是‘ADALINE’?为什么不是男性的名字?”Widrow 回答说:“嗯,这个名字恰好拼写为‘自适应线性神经元’。就是这样。”

In a 1963 episode of Science in Action titled “Computers that Learn,” produced by the California Academy of Sciences, the host and presenter, Earl S. Herald, introduces a robotic assembly that seems to be balancing a broom. “This may not seem very startling, because anyone can balance a broom. But this is a machine that can learn to do this…This is the story of MADALINE, the machine that in some respects thinks like a man [italics mine].” The sexism of the time rears its head again when Herald, barely two minutes into the episode, queries Widrow: “Let me ask you about the name ‘ADALINE.’ Why ‘ADALINE’? Why not a masculine name?” Widrow replies, “Well, this happens to spell ‘Adaptive Linear Neuron.’ And that’s it.”

ADALINE 与现代神经网络(具有多层并使用反向传播算法进行训练)之间的联系十分清晰。“LMS 算法是反向传播的基础。而反向传播是人工智能的基础,”Widrow 告诉我。“换句话说,如果追溯过去,现在的整个人工智能领域,[它] 都是从 ADALINE 开始的。”

The line connecting ADALINE to modern neural networks (which have multiple layers and are trained using an algorithm called backpropagation) is clear. “The LMS algorithm is the foundation of backprop. And backprop is the foundation of AI,” Widrow told me. “In other words, if you trace it back, this whole field of AI right now, [it] all starts with ADALINE.”

就反向传播算法而言,这是一个公平的评价。当然,罗森布拉特的感知器算法也可以做出类似的评价。罗森布拉特和威德罗共同为现代深度神经网络奠定了一些基石。但这些并不是唯一的努力。其他算法,同样是基础性的,正在被发明,它们在神经网络研究陷入困境的几十年里占据了主导地位,这主要是因为明斯基和帕普特对其局限性的评估过于严厉。这些非神经网络方法正在为基于概率和统计等进行学习的机器建立指导原则,这是我们的下一个中途站。

In terms of the backpropagation algorithm, this is a fair assessment. Of course, Rosenblatt’s perceptron algorithm can make similar claims. Together, Rosenblatt and Widrow laid some of the foundation stones for modern-day deep neural networks. But these weren’t the only such efforts. Other algorithms, also foundational, were being invented, and they’d rule the roost in the decades during which neural network research floundered, primarily because of Minsky and Papert’s unduly harsh assessment of its limitations. And these non-neural network approaches were establishing the governing principles for machines that learn based on, for example, probability and statistics, our next waystation.

第四章

CHAPTER 4

很有可能

In All Probability

概率论涉及在存在不确定性的情况下进行推理。即使是我们当中的佼佼者,这也是一件充满艰辛的事情。没有什么比蒙蒂霍尔难题更能说明不确定性如何扰乱我们的思维了。这个问题以美国电视节目《让我们做个交易》的主持人命名,在 1990 年引起了公众的关注,当时《Parade》杂志专栏“问玛丽莲”的一位读者向专栏作家玛丽莲·沃斯·莎凡特提出了如下问题:

Probability deals with reasoning in the presence of uncertainty. And it’s a fraught business for the best of us. There’s no better illustration of how uncertainty messes with our minds than the Monty Hall dilemma. The problem, named after the host of the American television show Let’s Make a Deal, became a public obsession in 1990 when a reader of the Parade magazine column “Ask Marilyn” posed the following question to columnist Marilyn vos Savant:

假设你正在参加一个游戏节目,有三个门供你选择。一扇门后面是一辆汽车,另外两扇门后面是山羊。你选了一扇门,比如说 1 号门,主持人知道门后面是什么,于是打开了另一扇门,比如说 3 号门,里面有一只山羊。然后他对你说:“你想选 2 号吗?”改变选择对你有利吗?”玩游戏的人陷入了困境。他们是否将选择从 1 号门改为 2 号门?这样做有什么好处吗?他们是否会增加选择隐藏汽车的门的几率?在我们看看沃斯·萨凡特的答案之前,让我们自己试着解决这个问题。这是我的直觉答案:

Suppose you are on a game show, and you’re given the choice of three doors. Behind one is a car; behind the others, goats. You pick a door, say, No. 1, and the host, who knows what’s behind the doors, opens another door, say, No. 3, which has a goat. He then says to you, ‘Do you want to pick No. 2?’ Is it to your advantage to switch your choice?” The person playing the game has a quandary. Do they switch their choice from door No. 1 to door No. 2? Is there any benefit to doing so, in that they will increase their odds of choosing the door hiding the car? Before we look at vos Savant’s answer, let’s try to tackle the problem ourselves. Here’s my intuitive answer:

在主持人打开其中一扇门之前,我选的门(1号门)后面有汽车的概率是三分之一。但主持人打开了3号门,发现后面有一只山羊。现在有两扇关着的门,其中一扇门后面就是汽车。我认为汽车停在两扇门后面的可能性相同。我没有理由改变我的选择。

Before the host opens one of the doors, the probability that a car is behind the door I’ve picked (Door No. 1) is one-third. But then the host opens Door No. 3 and reveals that there’s a goat behind it. Now there are two closed doors, and behind one of them is the car. I figure that the car is equally likely to be behind one or the other door. There’s no reason to switch my choice.

您可能有或没有类似的想法。如果您没有,那就太好了。

You may or may not have reasoned similarly. Kudos if you didn’t.

以下是沃斯·萨凡特关于你是否应该改变选择的建议:“是的,你应该改变。第一扇门有三分之一的获胜机会,但第二扇门有三分之二的机会。”她是对的。

Here’s what vos Savant advised regarding whether you should switch your choice: “Yes; you should switch. The first door has a one-third chance of winning, but the second door has a two-thirds chance.” And she’s correct.

美国人非常愤怒——不仅仅是那些没有受过概率论训练的人。正如数学家安东尼·洛·贝洛在一篇关于这场争吵的文章中所写,“此后不久,莎凡特收到了来自几所美国大学的博士生们的猛烈批评,他们指责她给出了错误答案;这些导师的名字在 1990 年 12 月 2 日的《Parade》杂志上被公布他们辩称,一旦主持人打开了失败的第三扇门,那么第一扇和第二扇门的获胜概率都只有 1/2。”

Americans were outraged—and not just folks untrained in the nuances of probability theory. As mathematician Anthony Lo Bello writes in an essay about this fracas, “Shortly thereafter, Savant received an avalanche of harsh rebukes from PhD’s [sic] on the faculties of several American universities, reproving her for giving, as they claimed, the wrong answer; the instructors, three of whose names were published in the issue of 2 December 1990 [of Parade magazine], argued that once the host had opened the losing third door, both the first and second doors then each had a probability of 1/2 of winning.”

沃斯莎凡特坚持自己的立场,并为批评者提供了得出答案的不同方法。她最好的直觉论证之一,用她的话来说,就是让你考虑一种不同的情况。假设有一百万扇门,其中一扇门后面是一辆汽车,其他门后面都藏着山羊。你选择了 1 号门。你猜对的几率只有百万分之一。然后主持人打开了所有你没有选择的门,只留下一扇。现在有两扇未打开的门,一扇是你选择的,另一扇是主持人关上的。当然,后一扇门里可能藏着一只山羊。但是在主持人选择不打开的所有门中,他为什么选择那扇门呢?“你会很快换到那扇门,不是吗?”沃斯莎凡特写道。

Vos Savant stood her ground and provided the critics with different ways of arriving at her answer. One of her best intuitive arguments, paraphrasing her, asks you to consider a different situation. Say there are a million doors, and behind one of them is a car; all the others hide goats. You choose Door No. 1. There’s a one-in-a-million chance you are correct. The host then opens all the other doors you did not choose, except one. Now there are two unopened doors, your choice and the one the host left closed. Sure, the latter door could hide a goat. But of all the doors the host chose not to open, why did he choose that one? “You’d switch to that door pretty fast, wouldn’t you?” wrote vos Savant.

数学家 Keith Devlin 对此给出了另一种解释。在心里给你的选择(1 号门)加上一个方框,在 2 号门和 3 号门的合计位置加上另一个方框。1 号门周围的方框有一个三分之一的概率与它相关,而 2 号门和 3 号门周围的盒子有三分之二的概率与它相关,即包含汽车。现在主持人打开大盒子里面的一扇门,露出一只山羊。大盒子的三分之二的概率转移到未打开的门上。切换是正确的答案。

Mathematician Keith Devlin gave another take on it. Put a mental box around your choice, Door No. 1, and another box around Doors No. 2 and 3 combined. The box around Door No. 1 has a one-third probability associated with it, and the box around Doors No. 2 and 3 has a two-thirds probability associated with it, in terms of containing the car. Now the host opens one of the doors inside the bigger box to reveal a goat. The two-thirds probability of the bigger box shifts to the unopened door. To switch is the correct answer.

但问题是,依靠直觉来解决这个问题是不可能的,因为很容易以稍微不同的方式思考这个问题,并被说服你不应该改变(就像我在之前的分析中那样)。如果你不改变,你就会和其他人一样,我不是说我。

But here’s the thing. It’s impossible to rely on intuition to solve this problem, because it’s just as easy to conceive of it slightly differently and be convinced that you shouldn’t switch (as I was in my earlier analysis). And if you didn’t switch, you’d be in august company, and I don’t mean mine.

匈牙利数学家安德鲁·瓦佐尼(Andrew Vázsonyi)在他的著作《哪扇门上有凯迪拉克? 》中,写到了另一位匈牙利数学家保罗·埃尔德什(Paul Erdős)的无能——“埃尔德什可以说是 20 世纪最多产的数学家,无论是从他解决的问题数量还是他说服别人解决的问题数量来看”——同意换门是更好的选择。1995 年,埃尔德什拜访瓦佐尼时,两人讨论了蒙提霍尔难题。当瓦佐尼告诉埃尔德什换门会增加获胜的几率时,埃尔德什拒绝接受这个答案:“他的反应就像刚被蜜蜂蛰了一下一样。‘不,那不可能,’他说。‘即使你换了也没什么区别。’”瓦佐尼试图说服他,概率不是静态的,它会随着环境的变化而变化。埃尔德什没有动摇。最终,瓦兹索尼用他编写的计算机程序对这个游戏进行了十万次模拟,结果表明,如果你不换人,那么主持人赢的概率为三分之二,而你会输;但是如果你换人,那么主持人输的概率为三分之二,而你会赢。“埃尔德什反驳说他仍然不明白其中的原因,但他还是勉强相信我是对的,”瓦佐尼写道。

In his book Which Door Has the Cadillac?, the Hungarian mathematician Andrew Vázsonyi writes about the inability of another Hungarian mathematician, Paul Erdős—“arguably the most prolific mathematician of the 20th century, in terms of both the number of problems he solved and the number of problems he convinced others to tackle”—to agree that switching doors is the better option. While Erdős was visiting Vázsonyi in 1995, the two discussed the Monty Hall dilemma. When Vázsonyi told Erdős that switching increased one’s odds of winning, Erdős refused to accept the answer: “He reacted as if he had just been stung by a bee. ‘No, that is impossible,’ he said. ‘It should make no difference if you switch.’ ” Vázsonyi tried to convince him that probabilities are not static, that they change as contexts change. Erdős wasn’t swayed. Eventually, Vázsonyi used a computer program he had written to run one hundred thousand simulations of the game and showed that the host won and you lost two-thirds of the time if you didn’t switch, but the host lost and you won two-thirds of the time if you did switch. “Erdős objected that he still did not understand the reason why, but was reluctantly convinced that I was right,” Vázsonyi wrote.

蒙蒂霍尔难题的故事概括如下:这个故事讲述了两种关于概率的思考方式之间的永恒争论:频率派和贝叶斯派。前一种方法利用了模拟,这似乎说服了埃尔德什。频率派关于事件发生概率的概念(例如,一枚硬币正面朝上)就是将事件发生的次数除以总试验次数(抛硬币的总次数)。当试验次数很少时,事件的概率可能与其真实值相差甚远,但随着试验次数变得非常大,我们得到了正确的概率测量值。下图显示了蒙蒂霍尔难题一万次试验的结果。(数据科学家 Paul van der Laken 展示了如何绘制转换和不转换时获胜的概率。这是一个版本。)

Encapsulated in this story about the Monty Hall dilemma is the tale of an eternal dispute between two ways of thinking about probability: frequentist and Bayesian. The former approach, which makes use of the simulation, is what seemingly convinced Erdős. The frequentist notion of the probability of occurrence of an event (say, a coin coming up heads) is simply to divide the number of times the event occurs by the total number of trials (the total number of coin flips). When the number of trials is small, the probability of the event can be wildly off from its true value, but as the number of trials becomes very large, we get the correct measure of the probability. The following figure shows the results of ten thousand trials of the Monty Hall dilemma. (Data scientist Paul van der Laken shows how to plot the probabilities of winning if you switch and if you don’t switch. This is one version.)

您可以清楚地看到,当试验次数较少时,概率会波动。但是,当试验次数超过大约四千次时,它们会稳定在正确的值上:切换的概率为 0.67(即三分之二),不切换的概率为 0.33(即三分之一)。

You can see clearly that when the number of trials is small, the probabilities fluctuate. But they settle into the correct values as the trials go beyond about four thousand: 0.67, or two-thirds, for switching, and 0.33, or one-third, for not switching.

但模拟并不是回答这些问题的唯一方法。另一种方法是依靠贝叶斯定理,这是概率论和机器学习的基石之一。

But simulations are not the only way of answering such questions. Another approach is to rely on Bayes’s theorem, one of the cornerstones of probability theory and, indeed, of machine learning.

是否采用贝叶斯

TO BAYES OR NOT TO BAYES

托马斯·贝叶斯出生年份的不确定性具有讽刺意味。据说他是“出生于 1701 年的概率为 0.8。”但他的去世日期却是确定的:1761 年 4 月 17 日,英国皇家滕布里奇韦尔斯。贝叶斯去世两年后,他的一位比他小 22 岁的密友理查德·普赖斯代表他向皇家学会提交了一篇论文。贝叶斯和普莱斯是志趣相投的人:都是知识分子、持不同政见的部长,当然还有数学家。普莱斯于 1763 年 11 月 10 日给他的朋友约翰·坎顿写了一封信,并附上一篇随附的文章。坎顿于 12 月 23 日将这封信宣读给了皇家学会:《解决概率学说问题的文章》。尽管普莱斯将这篇文章归功于贝叶斯,但学者们估计普莱斯对其内容做出了重大贡献。1764 年,普莱斯向皇家学会提交了另一篇有关该主题的论文,这次是作为唯一作者。综合起来,这些提交巩固了贝叶斯的地位,他为我们提供了同名定理,该定理催生了关于概率和统计学的整个思维方式,如今,近 250 年后,它已成为机器学习领域的一股强大力量。

There’s delicious irony in the uncertainty over Thomas Bayes’s year of birth. It’s been said that he was “born in 1701 with probability 0.8.” The date of his death, however, is firmly established: April 17, 1761, at Royal Tunbridge Wells in England. Two years after his death, a close friend, Richard Price, who was twenty-two years younger, presented a paper to the Royal Society on Bayes’s behalf. Bayes and Price were kindred spirits: intellectuals, dissenting ministers, and, of course, mathematicians. Price wrote a letter, with an accompanying essay, dated November 10, 1763, to his friend John Canton, and Canton read the correspondence to the Royal Society on December 23: An Essay Towards Solving a Problem in the Doctrine of Chances. Even though Price attributed the essay to Bayes, scholars have estimated that Price made a substantial contribution to its contents. Price submitted another paper on the topic to the Royal Society in 1764, this time as the sole author. Taken together, these submissions cemented the status of Bayes as the man who gave us his eponymous theorem, which birthed an entire way of thinking about probability and statistics and has now become, almost 250 years later, a formidable force in machine learning.

贝叶斯定理为我们提供了一种在充满不确定性的情况下以数学严谨性得出结论的方法。

Bayes’s theorem gives us a way to draw conclusions, with mathematical rigor, amid uncertainty.

最好使用具体的例子来理解这个定理。考虑一种只发生在 1,000 人中的 1 人的疾病的测试。假设测试的准确率为 90%,这意味着当患者患病时,测试结果十分之九为阳性,而当患者未患病时,测试结果十分之九为阴性。因此,它有 10% 的时间会出现假阴性,有 10% 的时间会出现假阳性。为简单起见,真阳性率(测试的灵敏度)和真阴性率(特异性)在此被视为相同例如,实际上,它们可能有所不同。现在你接受测试,结果是阳性。你患上这种疾病的几率有多大?我们假设接受测试的对象——在本例中是“你”——是从人群中随机挑选出来的。

It’s best to understand the theorem using a concrete example. Consider a test for some disease that occurs in only about 1 in 1,000 people. Let’s say that the test is 90 percent accurate, meaning that it comes back positive nine out of ten times when the person has the disease and that it is negative nine out of ten times when the person doesn’t have the disease. So, it gives false negatives 10 percent of the time and false positives 10 percent of the time. For the sake of simplicity, the rate of true positives (the sensitivity of the test) and the rate of true negatives (the specificity) are taken to be the same in this example; in reality, they can be different. Now you take the test, and it’s positive. What’s the chance you have the disease? We assume that the subject being tested—“you” in this case—has been picked at random from the population.

我们大多数人会说 90%,因为测试准确率 10 次中有 9 次是准确的。我们错了。要计算一个人在测试呈阳性的情况下患病的实际概率,我们需要考虑其他因素。为此,我们可以使用贝叶斯定理。

Most of us would say 90 percent, because the test is accurate 9 out of 10 times. We’d be wrong. To calculate the actual probability that one has the disease given a positive test, we need to take other factors into account. For this, we can use Bayes’s theorem.

该定理使我们能够在给定证据 E(测试结果为阳性)的情况下计算出假设 H(您患有该疾病)为真的概率。

The theorem allows us to calculate the probability of a hypothesis H (you have the disease) being true, given evidence E (the test is positive).

这写为P ( H | E ):给定 E 时 H 的概率。

This is written as P(H|E): the probability of H given E.

贝叶斯定理指出:

Bayes’s theorem says:

让我们来解析一下等式右边的各项项。

Let’s unpack the various terms on the right-hand side of the equation.

P(H):从人群中随机挑选一个人患病的概率。这也称为先验概率(在考虑任何证据之前)。在我们的例子中,根据迄今为止在一般人群中观察到的情况,我们可以假设它是11000,即 0.001。
P(H): The probability that someone picked at random from the population has the disease. This is also called the prior probability (before taking any evidence into account). In our case, we can assume it is 11000, or 0.001, based on what’s been observed in the general population thus far.
P(E|H):假设成立时证据的概率,或者简单地说,如果你患有这种疾病,检测结果呈阳性的概率。我们知道这一点。这是测试的灵敏度:0.9。
P(E|H): The probability of the evidence given the hypothesis or, to put it simply, the probability of testing positive if you have the disease. We know this. It’s the sensitivity of the test: 0.9.
P(E):检测呈阳性的概率。这是某人检测呈阳性的两种不同方式的概率之和给定人群中疾病的背景发生率。第一个是某人患病的先验概率(0.001)乘以某人检测呈阳性的概率(0.9),等于 0.0009。第二个是某人未患病的先验概率(0.999)乘以某人检测呈阳性的概率(0.1),等于 0.0999。
P(E): The probability of testing positive. This is the sum of the probabilities of two different ways someone can test positive given the background rate of the disease in the population. The first is the prior probability that one has the disease (0.001) multiplied by the probability that one tests positive (0.9), which equals 0.0009. The second is the prior probability that one doesn’t have the disease (0.999) times the probability that one tests positive (0.1), which equals 0.0999.

因此,P(E) = 0.0009 + 0.0999 = 0.1008

So, P(E) = 0.0009 + 0.0999 = 0.1008

因此,P(H|E) = 0.001 × 0.9 / 0.1008 = 0.0089,即 0.89% 的概率。

So, P(H|E) = 0.001 × 0.9 / 0.1008 = 0.0089, or a 0.89 percent chance.

这比我们之前直觉认为的 90% 的概率要低得多。这个最终数字称为后验概率:它是根据证据更新的先验概率。为了了解后验概率如何随着测试准确性的变化或人群中疾病背景率的变化而变化,让我们看一些数字:

That’s way lower than the 90 percent chance we intuited earlier. This final number is called the posterior probability: It’s the prior probability updated given the evidence. To get a sense of how the posterior probability changes with alterations to the accuracy of the test, or with changes in the background rate of the disease in the population, let’s look at some numbers:

如果测试准确率为 99%(100 次测试中只有 1 次会出现假阳性或假阴性),而人群背景患病率为 1/1000,那么即使测试结果呈阳性,您患病的概率也会上升到 0.09。这几乎是十分之一的几率。

For a test accuracy rate of 99 percent—only 1 in 100 tests gives a false positive or false negative—and a background rate of disease in the population of 1 in 1,000, the probability that you have the disease given a positive test rises to 0.09. That’s almost a 1-in-10 chance.

如果测试准确率为 99%(每 100 次测试中就有 1 次出现假阳性或假阴性),且人群背景患病率为 1/100(这种疾病现在已变得更加常见),则如果测试结果为阳性,则患病的概率上升至 0.5。即 50% 的几率。

For a test accuracy rate of 99 percent (1 in 100 tests gives a false positive or false negative), and a background rate of disease in the population of 1 in 100 (the disease has become more common now), the probability that you have the disease given a positive test rises to 0.5. That’s a 50 percent chance.

如果将检测准确率提高到 99.9%,背景率保持在 1/100,我们得到的后验概率为 0.91。如果检测结果呈阳性,则患病的可能性非常高。

Improve the test accuracy to 99.9 percent and keep the background rate at 1 in 100, and we get a posterior probability of 0.91. There’s a very high chance you have the disease if you tested positive.

通过对贝叶斯定理的快速介绍,我们准备好解决蒙提霍尔问题了。(这有点复杂。如果你觉得太多了,可以跳到本节末尾,尽管看到贝叶斯定理如何让我们得出玛丽莲·沃斯·莎凡特的答案是相当有启发性的。)

With this whirlwind introduction to Bayes’s theorem, we are ready to tackle the Monty Hall problem. (This is a bit involved. Feel free to skip to the end of this section if you think it’s too much, though it’s quite revealing to see how Bayes’s theorem gets us to Marilyn vos Savant’s answer.)

我们首先假设汽车随机隐藏在三扇门中的一扇后面。

We start by assuming that the car is hidden at random behind one of the three doors.

首先,我们来陈述一下我们的假设和先验。我们选择 1 号门。主持人打开 3 号门,门后面有一只山羊。我们必须弄清楚是否值得将我们的猜测从 1 号门改为 2 号门,以最大限度地提高我们选择隐藏汽车的门的机会。为此,我们必须找出两个假设的概率,并从两者中选择较高的一个。

Let’s start by stating our hypothesis and our priors. We pick Door No. 1. The host opens Door No. 3, behind which is a goat. We must figure out whether it’s worth switching our guess from Door No. 1 to Door No. 2, to maximize our chances of choosing the door that hides the car. To do this, we must figure out the probabilities for two hypotheses and pick the higher of the two.

第一个假设是:假设主持人打开了 3 号门,门后面是一只山羊,那么汽车就在 1 号门后面。第二个假设是:假设主持人打开了 3 号门,门后面是一只山羊,那么汽车就在 2 号门后面。考虑第一个假设的概率:

The first hypothesis is: Car is behind Door No. 1, given that host has opened Door No. 3 and revealed a goat. The second hypothesis is: Car is behind Door No. 2, given that host has opened Door No. 3 and revealed a goat. Consider the probability of the first hypothesis:

P(H=汽车在 1 号门后面 | E=主持人打开了 3 号门,露出一只山羊)。

P (H=car is behind Door No. 1 | E=host has opened Door No. 3, revealing a goat).

根据贝叶斯定理:

From Bayes’s theorem:

在哪里:

Where:

P (E|H):假设汽车在 1 号门后面,主持人打开 3 号门的概率。游戏开始时,你选择了 1 号门。如果汽车在后面,主持人可以看到,因此可以选择打开两扇门,要么是 2 号门,要么是 3 号门,这两扇门里都藏有山羊。他们打开其中一扇门的概率是 1/2。

P (E|H): the probability that the host opens Door No. 3, given that the car is behind Door No. 1. At the start of the game, you picked Door No. 1. If the car is behind it, the host can see that and, hence, has a choice of two doors to open, either No. 2 or No. 3, both of which hide goats. The probability they’ll open one of them is simply 1/2.

P(H):在任何一扇门被打开之前,汽车位于 1 号门后面的先验概率。它是 1/3。

P (H): the prior probability that the car is behind Door No. 1, before any door is opened. It’s 1/3.

P(E):主持人打开 3 号门的概率。必须仔细评估这一点,因为主持人知道你选择了 1 号门,而且他们可以看到每扇门后面的东西。因此,

P (E): the probability that the host opens Door No. 3. This must be carefully evaluated, given that the host knows that you have picked Door No. 1 and they can see what’s behind each door. So,

P(主持人选 3 号门)= P1 + P2 + P3

P (host picks Door No. 3) = P1 + P2 + P3

P1 = P(汽车在 1 号门后面)×P(主持人选择 3 号门,因为汽车在 1 号门后面)= P (C1) × P (H3|C1)

P1 = P (car is behind Door No. 1) × P (host picks Door No. 3, given car is behind Door No. 1) = P (C1) × P (H3|C1)

P2 = P(汽车在 2 号门后面)×P(主持人选择 3 号门,因为汽车在 2 号门后面)= P(C2)×P(H3|C2)

P2 = P (car is behind Door No. 2) × P (host picks Door No. 3, given car is behind Door No. 2) = P (C2) × P(H3|C2)

P3 = P(汽车在 3 号门后面)×P(主持人选择 3 号门,因为汽车在 3 号门后面)= P(C3)×P(H3|C3)

P3 = P (car is behind Door No. 3) × P (host picks Door No. 3, given car is behind Door No. 3) = P (C3) × P(H3|C3)

取等式右边的各个部分:

Take each part of the right-hand side of the equation:

  • P1:P(C1)×P(H3|C1)。
    • P(C1)=P(汽车在 1 号门后面)= 1/3。
    • P (H3|C1) — 如果汽车在 1 号门后面,那么主持人打开 3 号门的概率是 1/2。他们可以选择 2 号门或 3 号门。
    • 所以,P1 = 1/3 × 1/2 = 1/6。
  • P1: P (C1) × P (H3|C1).
    • P (C1) = P (car is behind Door No. 1) = 1/3.
    • P (H3|C1)—if the car is behind Door No. 1, then the probability that the host opens Door No. 3 is 1/2. They could have picked either Door No. 2 or Door No. 3.
    • So, P1 = 1/3 × 1/2 = 1/6.
  • P2:P(C2)×P(H3|C2)。
    • P(C2)=P(汽车在2号门后面)= 1/3。
    • P(H3|C2)——如果汽车在2号门后面,那么主持人打开3号门的概率是1,因为他们不能选择2号门,否则就会露出汽车。
    • 因此,P2 = 1/3 × 1 = 1/3
  • P2: P (C2) x P (H3|C2).
    • P (C2) = P (car is behind Door No. 2) = 1/3.
    • P (H3|C2)—if the car is behind Door No. 2, then the probability that the host opens Door No. 3 is 1, because they cannot pick Door No. 2, otherwise it’ll reveal the car.
    • So, P2 = 1/3 × 1 = 1/3
  • P3:P(C3)×P(H3|C3)。
    • P(C3) = P(汽车在3号门后面) = 1/3。
    • P(H3|C3)——如果汽车在3号门后面,那么主持人打开3号门的概率为0,否则就会露出汽车。
    • 因此,P3 = 1/3 × 0 = 0
  • P3: P (C3) × P (H3|C3).
    • P (C3) = P (car is behind Door No. 3) = 1/3.
    • P (H3|C3)—if the car is behind Door No. 3, then the probability that the host opens Door No. 3 is 0, otherwise it’ll reveal the car.
    • So, P3 = 1/3 × 0 = 0

因此,P(E)= P1 + P2 + P3 = 1/6 + 1/3 + 0 = 3/6 = 1/2

So, P (E) = P1 + P2 + P3 = 1/6 + 1/3 + 0 = 3/6 = 1/2

现在,根据以下证据,我们可以计算出假设 1 成立的概率:

We can now calculate the probability that hypothesis 1 is true, given the evidence:

汽车位于你选择的门后面的概率是 1/3。

The probability that the car is behind the door you have picked is 1/3.

现在我们来计算第二个假设的概率:假设主持人打开了 3 号门,门后面是一只山羊,那么汽车就在 2 号门后面。我们可以做类似的分析。

Now let’s calculate the probability for the second hypothesis: The car is behind Door No. 2 given that the host has opened Door No. 3, revealing a goat. We can do a similar analysis.

P(E|H):假设汽车在 2 号门后面,主持人打开 3 号门的概率。主持人无法打开 2 号门。他们必须打开 3 号门,所以这个事件的概率是 1。

P (E|H): Probability that the host opens Door No. 3, given that the car is behind Door No. 2. The host cannot open Door No. 2. They have to open Door No. 3, so the probability of this event is 1.

P(H):在任何一扇门被打开之前,汽车位于 2 号门后面的先验概率。它是 1/3。

P (H): The prior probability that the car is behind Door No. 2, before any door is opened. It’s 1/3.

P(E):如之前计算的那样,它是½。

P (E): As computed before, it’s ½.

很明显,第二个假设——假设主持人打开了 3 号门,那么汽车在 2 号门后面——的概率比汽车在 1 号门(你最初的选择)后面的概率更高。你应该换门!

Very clearly, the second hypothesis—that the car is behind Door No. 2, given that the host has opened Door No. 3—has a higher probability compared to the probability that the car is behind Door No. 1 (your original choice). You should switch doors!

如果这一切让你觉得违反直觉,而你仍然拒绝改变对门的选择,这是可以理解的。概率不一定是直觉的。但当机器将这种推理纳入其决策中时,我们的直觉就不会成为障碍。

If all this feels counterintuitive and you still refuse to change your choice of doors, it’s understandable. Probabilities aren’t necessarily intuitive. But when machines incorporate such reasoning into the decisions they make, our intuition doesn’t get in the way.

谁在乎?

WHO GIVES A TOSS?

大多数机器学习本质上都是概率性的,即使算法没有明确设计成这样。但这一概念在关于现代人工智能能力的无数说法中被忽视了。智能。让我们以之前遇到的感知器算法为例。给定两组线性可分的数据,该算法将找到一个超平面,该超平面可以将某个坐标空间中存在的数据一分为二。正如我们很快将更详细地看到的那样,该算法会找到一个超平面,但不一定是最佳超平面(对于“最佳”的某些定义)。然后,给定一个新的数据实例,该算法会检查数据点是落在超平面的一侧还是另一侧,并据此对该数据点进行分类。听起来很确定,不是吗?感知器的概率性是什么?

Most machine learning is inherently probabilistic, even if the algorithm is not explicitly designed to be so. But that notion gets lost in the myriad claims about the capabilities of modern artificial intelligence. Let’s take the perceptron algorithm we encountered earlier. Given two sets of data that are linearly separable, the algorithm will find a hyperplane that can divide the data that exists in some coordinate space into two. As we’ll see in more detail soon, the algorithm finds a hyperplane, not necessarily the best one (for some definition of “best”). Then, given a new instance of data, the algorithm checks to see whether the data point lands on one side of the hyperplane or the other and classifies that data point accordingly. Sounds pretty deterministic, doesn’t it? What’s probabilistic about the perceptron?

预测是概率性的,因为算法在对新数据点进行分类时出错的概率是有限的。分类完全取决于感知器找到的超平面。无数个超平面可以将原始数据分成两部分。因此,一个超平面可能将新数据点归类为属于 A 类,而另一个超平面可能将同一数据点归类为属于 B 类。从数学上讲,可以推导出感知器预测中固有的错误风险。这就是预测具有概率性的原因,即使算法的输出是黑白分明(或 -1 和 1)。

The prediction is probabilistic in the sense that there’s a finite chance that the algorithm will make an error when classifying a new data point. The classification depends entirely upon the hyperplane found by the perceptron. An infinite number of hyperplanes can separate the original data into two. So, one hyperplane might classify a new data point as belonging to class A, while another might classify the same data point as belonging to class B. It’s possible, mathematically, to derive the risk of error inherent in a perceptron’s predictions. That’s what makes the prediction probabilistic, even though the algorithm’s output is rather black and white (or -1 and 1).

这为我们提供了一种思考机器学习的宝贵方式:从概率、分布和统计的角度。为了达到这个目的,我们需要一门速成课程来了解这些思想背后的基础知识。

This brings us to a valuable way of thinking about machine learning: in terms of probabilities, distributions, and statistics. To get there, we need a crash course in the basics behind these ideas.

我们先从实验的概念开始。实验可以是任何事情:抛一枚硬币,或连续抛两次硬币,或测量室外温度。所有实验都有结果,根据你对实验的定义,你会得到一组不同的结果。在每个实验中,我们可以为该结果分配一个数字,这个数字称为随机变量。我们称之为X(你可以随意称呼它)。以下是X如何采用数值的示例:

We start with the notion of an experiment. An experiment can be anything: the toss of one coin, or two consecutive tosses of a coin, or the measurement of the temperature outside. All experiments have outcomes, and depending on how you define an experiment, you get a different set of outcomes. In each of these experiments, we can assign a number to that outcome, and this number is known as a random variable. Let’s call it X (you can call it anything). Here are examples of how X can take on numerical values:

抛一次硬币:

For one toss of a coin:

连续两次抛硬币:

For two consecutive tosses of a coin:

对于温度的测量:

For the measurement of temperature:

X = {绝对零点(-273 C)和∞之间的实数

X = {a real-valued number between abs zero (-273 C) and ∞

让我们抛一枚硬币,比如说十次。每次抛硬币,随机变量X可以是 0 或 1。我们在 x 轴上绘制X的值,在 y 轴上绘制试验次数,得到一个条形图。例如,像这样(6 次正面和 4 次反面):

Let’s toss our single coin, say, ten times. For each toss, the random variable X can be either 0 or 1. We plot the value of X on the x-axis, against the number of trials on the y-axis, to get a bar chart. For example, something like this (6 heads and 4 tails):

由于试验次数相对较少,正面和反面的次数可能会有很大差异。我们可以绘制相同的数据以X为 0 或X为 1 的实验概率来表示。在“概率”一词前使用形容词“实验”表示这是一个经验结果。实验概率或经验概率与理论概率略有不同。

Given the relatively small number of trials, the number of heads versus the number of tails can vary a lot. We can plot the same data in terms of the experimental probability of X being 0 or X being 1. The use of the adjective “experimental” before the word “probability” signifies that it’s an empirical result. Experimental, or empirical, probability is somewhat different from theoretical probability.

理论上,一次抛硬币正面的概率只有一半,但经验概率则取决于我们实际实验的结果。在下文的讨论中,我将避免使用形容词“经验”或“理论”,除非上下文中无法明确概率的类型。

The theoretical probability of getting heads on a single coin toss is simply one-half, but the empirical probability depends upon the outcomes of our actual experiments. I’m going to avoid the adjective “empirical” or “theoretical” in the discussions that follow unless the type of probability isn’t clear from the context.

对于我们进行 10 次抛硬币的实验, X = 0的概率为 0.6,X = 1 的概率为 0.4。总概率应为 1。

For our experiment with 10 coin tosses, the probability of X = 0 is 0.6, and the probability of X = 1 is 0.4. The total probability should add up to 1.

我们刚刚看到的是概率分布最简单的例子。它被称为伯努利分布,它决定了“离散”随机变量X的值的分布方式。在这种情况下,X只能取离散值,0 或 1。正式来说,伯努利概率分布由函数 P(X) 指定:

What we have just seen is the simplest example of a probability distribution. It’s called a Bernoulli distribution, and it dictates the way the values of a “discrete” random variable X are distributed. In this case, X can take on only discrete values, 0 or 1. Formally, the Bernoulli probability distribution is specified by the function P(X):

P(X) 也称为概率质量函数,它表示随机变量X取值为 1 的概率为p,X为 0的概率为 ( 1 - p )。对于一枚公平硬币,即正面和反面出现的概率相等的硬币,p等于 0.5。

P(X) is also called the probability mass function, and it states that the probability of the random variable X taking on the value 1 is p, and the probability of X being 0 is (1-p). For a fair coin, one that is equally likely to come up heads or tails, p equals 0.5.

这让我们对从基础分布中抽样的想法有了准备。基础分布是某些随机变量的基本事实——在本例中,是抛一枚公平硬币的结果。每次抛硬币都是从基础分布中抽取的样本。以下是当我们抽样 100,000 次时发生的情况的一个例子。(由于所谓的平方根定律,正面和反面的计数将相差一个值,该值的数量级是总试验次数的平方根;在本例中,它将是 100,000 的平方根的数量级,或大约 316。)

This prepares us for the idea of sampling from an underlying distribution. An underlying distribution is the ground truth for some random variable—in this case, the outcome of the toss of a fair coin. Every toss of the coin is a sample from the underlying distribution. Here’s an example of what happens when we sample a 100,000 times. (Thanks to something called the square root law, the counts of heads and tails will differ by a value that’s on the order of the square root of the total number of trials; in this case, it will be on the order of the square root of 100,000, or about 316.)

在机器学习中,我们从数据开始。我们手中的数据分布代表了数据的一些基本分布。因此,如果我们拥有的只是 100,000 次抛硬币的结果,那么正面和反面的分布将是我们了解描述公平硬币抛掷的基本分布的最佳方式。在我们再举一个离散分布的例子时,请记住这个想法。

In machine learning, we start with data. The distribution of the data we have in hand is representative of some underlying distribution of the data. So, if all we had were the results of 100,000 coin tosses, then that distribution of heads and tails would be our best insight into the underlying distribution describing the toss of a fair coin. Just hold that thought in mind while we take one more example of a discrete distribution.

考虑一个奇怪的数字显示屏,当你按下按钮时,它会显示一个介于 0 到 6 之间的数字。这里,随机变量X是显示的数字。因此,X可以是 [0、1、2、3、4、5、6] 之一。但显示屏是被操纵的。X的不同值的概率并不相等。假设底层分布由以下公式给出:

Consider a weird digital display that, when you press a button, shows a number between 0 and 6. Here, the random variable X is the number that is displayed. So, X can be one of [0, 1, 2, 3, 4, 5, 6]. But the display is rigged. The probabilities for the different values of X are not equal. Let’s say the underlying distribution is given by:

这些是理论上的概率。

These are the theoretical probabilities.

如果我们按下按钮 1,000 次,我们将对底层分布进行 1,000 次采样,并且我们可能会得到X的观测值分布和相应的经验概率:

If we pushed the button 1,000 times, we would be sampling the underlying distribution 1,000 times, and we might get this distribution of the observed values of X and the corresponding empirical probabilities:

给定这个分布,有一些统计参数值得关注。一个是所谓的期望值。我们在讨论 Bernard Widrow 的 LMS 算法时遇到了它,并推迟了解释。好吧,你可以通过以下方式获得X的期望值:计算X的每个值,然后乘以其概率,最后将它们全部相加。因此,对于我们的实验:

Given this distribution, there are some statistical parameters of interest. One is something called the expected value. We encountered it while discussing Bernard Widrow’s LMS algorithm and put off an explanation. Well, you get the expected value of X by taking each value of X and multiplying it by its probability and summing them all up. So, for our experiment:

EX)=
E(X) =
P(0)+1× P(1)+2× P(2)+3 × P(3)+4 × P(4)+5× P(5)+6× P(6)
0 × P(0) + 1 × P(1) + 2 × P(2) + 3 × P(3) + 4 × P(4) + 5 × P(5) + 6 × P(6)
EX)=
E(X) =
0×0.032+1×0.056+2×0.116+3×0.584+4×0.127+5×0.056+6×0.029
0 × 0.032 + 1 × 0.056 + 2 × 0.116 + 3 × 0.584 + 4 × 0.127 + 5 × 0.056 + 6 × 0.029
EX)= 3
E(X) = 3

这是您期望在多次大样本试验中得到的随机变量X的值。我们用另一个名字来称呼它:分布的平均值。如果X所有值的概率都相同,那么期望值或分布的平均值将与算术平均值相同。(将所有值相加并除以值的总数。)

This is the value you’d expect to get for the random variable X over some large-number trials. We know this by another name: the mean of the distribution. If the probabilities of all the values of X were identical, then the expected value, or mean of the distribution, would be the same as the arithmetic average. (Sum up all the values and divide by the total number of values.)

现在来谈谈另外两个极其重要的统计参数:方差和标准差。首先,方差:

Now come two other extremely important statistical parameters: the variance and the standard deviation. First, the variance:

基本上,取X 的每个值,从中减去X的预期值,求其平方,将结果乘以该X值的概率,然后对所有X值求和。这就是方差。

Basically, take each value of X, subtract from it the expected value of X, square it, multiply the result with the probability of that value of X, and sum over all values of X. That’s the variance.

标准差定义为方差的平方根。标准术语为:

The standard deviation is defined as the square root of the variance. In standard terminology:

方差和标准差均表示X围绕平均值的离散程度或扩展程度。

Both the variance and the standard deviation are indicative of the dispersion, or spread, of X about the mean.

我们现在要换个角度。如果随机变量取连续值而不是离散值会怎么样?有传言说,自然界中连续随机变量最常见的分布之一就是所谓的正态分布,即我们熟悉的钟形曲线。我以前也这么认为,直到加州大学伯克利分校教授、概率和统计学专家菲利普·斯塔克 (Philip Stark) 打消了我的这种想法。笑话在于,理论家认为这是经验事实,而实验家认为这是理论事实,”他告诉我,同时承认他不记得是谁最先想出了这个妙语。“正态近似在有效时有效。问题在于知道它何时有效。”哦,好吧。当它持续存在时,这是一个美好的错觉。尽管如此,考虑到它在机器学习中发挥的巨大作用,它还是值得仔细研究一下正态分布的。

We will change tack now. What if the random variable took on continuous values, instead of discrete ones? There’s folklore that one of the most common distributions in nature for continuous random variables is the so-called normal distribution, with the familiar bell-shaped curve. I thought so, too, until Philip Stark, a professor at the University of California, Berkeley, and an expert on probability and statistics, disabused me of the notion. “The joke is that theoreticians think it’s an empirical fact, and experimentalists think it’s a theoretical fact,” he told me, while acknowledging he couldn’t recall who first came up with the quip. “The normal approximation works when it works. The problem is knowing when it works.” Oh, well. It was a nice delusion while it lasted. Still, it’s well worth taking a closer look at the normal distribution, given the outsize role it plays in machine learning.

考虑一个随机变量,该变量等于一个人的平均静息体温(以华氏度为单位)。我们知道,我们的体温每天都在变化,每小时都在变化;除非你生病了,否则变化不大,但无论如何都会有变化。然而,我们每个人都有一个平均的静息体温,事实证明,我们每个人的体温都略有不同。让我们绘制一个非常大的个体样本的平均静息体温(见下图)。

Consider a random variable that’s equal to a person’s average resting body temperature, in Fahrenheit. We know that our body temperature varies from day to day, from hour to hour; not by much, unless you are ill, but it varies nonetheless. Yet, we all have an average resting body temperature, which, it turns out, is slightly different for each one of us. Let’s plot the average resting body temperature of a very large sample of individuals (see figure, below).

x 轴是随机变量 X 的值等于平均静息体温,单位为华氏度 (F)。现在,让我们将 y 轴作为经验概率的度量X取 95 至 102 之间的某个值(y 轴有更精确的含义,我们稍后再讨论。)

The x-axis is the value of the random variable X, which equals the average resting body temperature in degrees Fahrenheit (F). For now, let’s take the y-axis as a measure of the empirical probability that X takes on some value between 95 and 102. (There’s a more precise meaning for what the y-axis stands for, but we’ll come to that.)

理想曲线在 98.25°F 处达到峰值,并以此为对称点。该值是分布的平均值,该曲线称为“正态”或高斯分布。

This idealized curve peaks at 98.25°F and is symmetric about it. This value is the mean of the distribution, and the curve is called a “normal,” or Gaussian, distribution.

我们的曲线的标准差为 0.73,这个参数在正态分布中具有非常特殊的含义。如果您在平均值左侧(98.25 - 0.73)绘制两条垂直线,在平均值右侧(98.25 + 0.73)绘制一条标准差为一个的垂直线,则曲线下方的区域包含 68% 的 X 观测值。区域在下图中以灰色显示(见下图)。几乎所有的X观测值都在平均值的 3 个标准差范围内。

Our curve has a standard deviation of 0.73, a parameter that has a very specific meaning in the context of a normal distribution. If you draw two vertical lines one standard deviation to the left of the mean (98.25 - 0.73) and one standard deviation to the right of the mean (98.25 + 0.73), you get a region under the curve that contains 68 percent of the observed values of X. The region is shown in gray in the following image (see figure, below). And almost all the observed values of X will lie within 3 standard deviations of the mean.

再次强调,方差只是标准差的平方。因此,方差和标准差告诉你一些关于值偏离平均值的情况。较大的标准差会给你一个更宽、更矮的图。(如果平均值为零,标准差等于 1,我们就有所谓的“标准”正态分布。

Again, the variance is simply the square of the standard deviation. So, the variance and standard deviation are telling you something about the spread of values away from the mean. A larger standard deviation would give you a broader, squatter plot. (If the mean is zero and the standard deviation equals 1, we have what’s called a “standard” normal distribution.

让我们重新看一下连续随机变量图中的 y 轴。离散随机变量的特征在于其概率质量函数(PMF),连续随机变量的特征在于所谓的概率密度函数(PDF)。

Let’s revisit the y-axis in our plot of the continuous random variable. While a discrete random variable is characterized by its probability mass function (PMF), a continuous random variable is characterized by what’s called the probability density function (PDF).

对于离散随机变量,可以使用 PMF 来确定X具有特定值的概率。(在我们上面的一个实验中,对于头部, X = 0 的概率为 0.50279。)但对于连续随机变量,我们无法做类似的事情。当变量的值是连续的时,这意味着您可以越来越精确地获得其确切值。假设记录体温的温度计是任意精确的。这意味着温度计读数可以是 95 到 102 之间无数个可能值中的一个。因此,即使可以使用正态分布来表征随机变量,但它具有某个特定的、无限精确的值的概率实际上为零。

In the case of a discrete random variable, one can use a PMF to determine the probability that X has a certain value. (In one of our experiments above, the probability that X = 0, for heads, is 0.50279.) But one cannot do something similar for a continuous random variable. When the variable’s value is continuous, it means that you can keep getting more and more precise about its exact value. Let’s say the thermometer recording body temperature is arbitrarily precise. That implies the thermometer reading can be one of an infinite number of possible values between 95 and 102. So, even though a normal distribution can be used to characterize the random variable, the probability that it has some specific, infinitely precise value is actually zero.

这就是为什么在处理概率密度函数和连续随机变量时,人们只能用随机变量取两个数字之间的值的概率来表达(例如,介于 98.25 和 98.5 之间)。然后,随机变量取该范围内的值的概率由概率密度函数下的面积给出,该面积由该范围的端点界定。此外,由于总概率必须加起来为 1,因此整个 PDF 下的面积等于 1。

That’s why when dealing with probability density functions and continuous random variables, one can speak in terms only of the probability that the random variable will take on a value between two numbers (say, between 98.25 and 98.5). Then, the probability that the random variable has a value in that range is given by the area under the probability density function, bounded by the end points of that range. Also, because the total probability must add up to 1, the area under the entire PDF equals 1.

到目前为止,对于我们的目的而言,关键信息是:无论是离散随机变量的概率质量函数还是连续随机变量的概率密度函数,一些具有特征参数的众所周知且分析上易于理解的函数都可用于描述概率分布。例如,对于伯努利分布,我们只需要概率p。对于正态分布,我们需要均值和方差;这两个数字给出了钟形曲线的精确形状。这些参数是建模某些基础分布的一种方法。

The key message so far, for our purposes, is this: Whether it’s a probability mass function for a discrete random variable or a probability density function for a continuous random variable, some well-known and analytically well-understood functions, with characteristic parameters, can be used to describe the probability distributions. For example, we just need the probability p for the Bernoulli distribution. For the normal distribution, we need the mean and variance; those two numbers give you the exact shape of the bell curve. These parameters are a way to model some underlying distribution.

掌握了这些概率和统计学的基本知识后,我们可以重新将机器学习视为概率推理和统计学习。

With these barest of bare-minimum basics of probability and statistics in hand, we can get back to thinking about machine learning as probabilistic reasoning and statistical learning.

六合一,六合六合

SIX OF ONE, HALF A DOZEN OF THE OTHER

让我们从最常见的机器学习形式开始,我们已经遇到过这种形式,称为监督学习。我们给出一些标记数据X 。 X的每个实例都是一个d维向量,这意味着它有d 个分量。因此,X是一个矩阵,其中矩阵的每一行都是数据的一个实例。

Let’s start with the most common form of machine learning, one we have already encountered, called supervised learning. We are given some labeled data, X. Each instance of X is a d-dimensional vector, meaning it has d components. So, X is a matrix, where each row of the matrix is one instance of the data.

[ x 1, x 2, x 3, x 4,…, xd ]

[x1, x2, x3, x4,…, xd]

假设每个X实例都代表一个人。而组件 [ x1, x2, x3,…, xd ] 可以表示该人的身高、体重、体质、胆固醇水平、血压等值。每个X实例都与一个标签y相关联。假设如果该人在测量生理参数后的五年内没有心脏病发作,则y为 -1,如果心脏病发作,则为 1。在监督学习中,一些算法(例如感知器)会获得训练数据:一组代表n个人的数据点(因此X是一个n x d矩阵,n行,d列);X的每一行都有一个对应的标签y,等于 -1 或 1。y 的所有值加在一起形成向量y。算法使用这些数据来了解Xy底层分布。

Each instance of X could represent, say, a person. And the components [x1, x2, x3,…, xd] could be values for the person’s height, weight, body mass, cholesterol levels, blood pressure, and so on. Associated with each instance of X is a label, y. Let’s say y is -1 if the person did not have a heart attack in the five years following the date their physiological parameters were measured, and 1 if they did. In supervised learning, some algorithm—the perceptron, for example—is given the training data: a set of data points representing n people (so X is an n x d matrix, n rows, d columns); each row of X has a corresponding label y, equal to -1 or 1. All values of y taken together form the vector y. The algorithm uses these data to learn something about the underlying distribution of X and y.

假设 P ( X, y ) 是基础概率分布,该分布捕获了我们对所有人类在当前生理状态下未来五年内心脏病发作风险的了解。当我们创建一个包含多个个体的数据集时,假设这些个体是随机选择的,那么该数据集就类似于从这个基础分布中提取或采样。给定一个未标记的新个体的数据,ML 算法必须预测此人在未来五年内是否有心脏病发作的风险。现在,如果您知道基础分布,您就可以非常简单地找出在 x 的情况下此人有风险的概率,以及在x的情况下此人没有风险的概率(其中x指的是单个人或X的一个实例的向量)。

Let’s say that P (X, y) is the underlying probability distribution, which captures our knowledge about all humans as to their risk of having a heart attack in the next five years, given a person’s current physiological state. When we create a dataset of several individuals, assuming that these individuals are randomly chosen, the dataset is akin to drawing from, or sampling, this underlying distribution. The ML algorithm, given data about a new unlabeled individual, must predict whether that person is at risk of a heart attack in the next five years. Now, if you knew the underlying distribution, you could very simply figure out the probability that the person was at risk given x and the probability that the person was not at risk given x (where x refers to the vector for a single person or an instance of X).

P ( y = 有​​风险 | x ) 和 P ( y = 无风险 | x )

P (y = at-risk | x) and P (y = not-at-risk | x)

然后,进行预测的一种方法是选择概率较高的类别。本章后面的部分将介绍如何做到这一点(这需要使用贝叶斯定理),但现在,我们只需要认识到,这是 ML 算法所能做到的最好,因为它可以访问底层分布。这样的分类器称为贝叶斯最优分类器。

Then, one way to make a prediction would be to choose the category that had the higher probability. Later in the chapter, we’ll come to just how you can do this (it involves using Bayes’s theorem), but for now, all we need to appreciate is that this is the best an ML algorithm can do, because it has access to the underlying distribution. Such a classifier is called a Bayes optimal classifier.

但几乎在所有情况下,都不可能知道底层分布。因此,可以说,概率 ML 算法的任务归结为从数据中估计分布。有些算法比其他算法做得更好,但都会犯错误。因此,当你听到有人说人工智能正在做出准确的预测时,请记住,它几乎不可能 100% 准确。无论如果算法是隐式概率(如感知器的情况)或显式概率(稍后我们将看到这些例子),它可能会出错。然而,这并不是对机器学习的打击。作为人类,虽然我们似乎认为我们做出了理性、无错误的决定,但我们也会做出概率决定。只是这些概率事件发生在幕后,我们大多不知道它们。

But in just about every case, it’s impossible to know the underlying distribution. So, the task of probabilistic ML algorithms, one can say, comes down to estimating the distribution from data. Some algorithms do it better than others, and all make mistakes. So, when you hear claims that an AI is making accurate predictions, remember that it’s well-nigh impossible to be 100 percent accurate. Whether an algorithm is implicitly (as in the case of the perceptron) or explicitly probabilistic (we’ll see examples of these in a bit), it can be wrong. And yet, this is not a knock against machine learning. As humans, while we seem to think we make rational, error-free decisions, we also make probabilistic decisions. It’s just that these probabilistic goings-on happen under the hood, so to speak, and we are mostly unaware of them.

估计底层分布并非易事。首先,对分布的形状做出一些简化的假设通常更容易。它是伯努利分布吗?它是正态分布吗?请记住,这些理想化的分布描述仅仅是理想化的;它们使数学变得更容易,但不能保证底层分布完全符合这些数学形式。此外,当我们有数据时,我们就可以访问利率。例如,一枚有偏差的硬币正面朝上的次数。我们必须将利率转化为概率:两者并不相同,并且将它们等同起来可能会有问题;正如斯塔克告诉我的那样,“这是一个巨大的认识论飞跃。”尽管如此,我们仍要谨慎行事。

Estimating underlying distributions is not trivial. For starters, it’s often easier to make some simplifying assumptions about the shape of the distribution. Is it a Bernoulli distribution? Is it a normal distribution? Keep in mind that these idealized descriptions of distributions are just that: idealized; they make the math easier, but there’s no guarantee that the underlying distribution hews exactly to these mathematical forms. Also, when we have data, we have access to rates. For example, how many times a biased coin came up heads. We must turn rates into probabilities: The two are not the same, and equating them can be problematic; as Stark told me, “That’s a huge epistemic jump.” Nonetheless, it’s one we make with caution.

假设您确实假设了基础分布的类型。在每种情况下,分布都由一些参数来表征。例如,我们看到伯努利分布的特征是值p,然后您可以写下这个概率质量函数:

Let’s say you do assume the type of underlying distribution. In each case, the distribution is characterized by some parameters. We saw that a Bernoulli distribution, for example, is characterized by the value p, which then allows you to write down this probability mass function:

正态分布的特征是平均值和标准差。还有其他类型的分布。有些有自己的一组参数,用希腊字母θ (theta) 表示。(我们暂时忽略所谓的非参数分布,它们不是用一些参数集指定。)然后底层分布写为:

A normal distribution is characterized by the mean and standard deviation. There are other types of distributions. Some have their own set of parameters, denoted by the Greek letter θ (theta). (Let’s ignore for now so-called nonparametric distributions, which are not specified with some set of parameters.) The underlying distribution is then written as:

X,y

Pθ(X, y)

或者我们可以用单个字母“ D ”(代表“数据”)来指代Xy,并按如下方式写出分布:

Or we can refer to both X and y with a single letter “D” (for “data”), and write the distribution in this way:

( D )P ( D ; θ )

Pθ(D) or, P(D; θ)

所有这些都引出了关于 ML 算法的广泛论述。我们可以重点关注两种重要方法。(还有其他方法,但这两种方法让我们直面一些有趣的问题。)

All this is leading up to a broad statement about ML algorithms. We can focus on two important approaches. (There are others, but these two get us to the heart of some interesting issues.)

  • 在第一种方法中,给定数据,ML 算法会针对某些分布类型(伯努利分布、高斯分布或其他分布)计算出最佳的θ ,从而最大程度地提高看到数据D的可能性。换句话说,您正在估计具有参数θ的最佳底层分布,这样如果您从该分布中抽样,您将最大程度地提高观察到您已经掌握的标记数据的可能性。毫不奇怪,这种方法被称为最大似然估计 (MLE)。它最大化 P ( D | θ ),即在给定θ 的情况下观察到D的概率,并且与频率学派方法论松散地相关。
  • In the first method, given data, the ML algorithm figures out the best θ, for some choice of distribution type (Bernoulli or Gaussian or something else), which maximizes the likelihood of seeing the data, D. In other words, you are estimating the best underlying distribution, with parameter θ, such that if you were to sample from that distribution, you would maximize the likelihood of observing the labeled data you already had in hand. Not surprisingly, this method is called maximum likelihood estimation (MLE). It maximizes P (D | θ), the probability of observing D given θ, and is loosely associated with frequentist methodology.

举一个具体的例子,我们以两组人为例,一组人高,一组人矮。我们从每组中抽取了几百个身高样本。我们的工作是通过最大化 P( D | θ ) 来估计底层分布 P( D ) 。假设每组身高(高矮)都建模为高斯分布,具有自己的均值和方差,整体分布将是这两个高斯分布的组合。MLE 不假设哪些高斯分布比其他高斯分布更有可能,这意味着所有参数值的可能性都相同。有了这个假设,或者没有这个假设,MLE 会最大化 P ( D | θ ),其中θ指的是均值和方差。同样,通过将θ的所有值视为同样可能,MLE 会给我们一个θ,使观察我们收集的数据的可能性最大化。

As a concrete example, let’s take two populations of people, one tall and the other short. We have a few hundred samples of the heights of people from each group. Our job is to estimate the underlying distribution P (D) by maximizing P (D | θ). Let’s say that each set of heights (short and tall) is modeled as a Gaussian distribution, with its own mean and variance, and the overall distribution will be a combination of these two Gaussians. MLE makes no assumptions about which Gaussians are more likely than others, meaning that all values for the parameters are equally likely. Given this assumption, or lack of one, MLE maximizes P (D | θ), where θ here refers to the mean and variance. Again, by treating all values of θ as equally likely, MLE gives us the θ that maximizes the likelihood of observing the data we collected.

  • 在第二种方法中,给定采样数据,ML 算法最大化 P( θ | D ),即,给定数据,找到最可能的θ。正如康奈尔大学计算机科学教授 Kilian Weinberger 在一次 ML 讲座上雄辩地告诉学生的那样,这句话中隐藏着一些让频率论者抓狂的东西。找到最可能的θ的想法意味着θ本身遵循分布,也就是它被视为一个随机变量。现在,第二种方法在没有看到数据的情况下对哪个θ最有可能做出假设。这就是先验概率分布。贝叶斯统计学家认为,对θ的值有一个先验信念是完全合理的。
  • In the second method, given the sampled data, the ML algorithm maximizes P (θ | D), i.e., it finds the most likely θ, given the data. Hidden in that statement is something that causes frequentists to tear their hair out, as Kilian Weinberger, professor of computer science at Cornell University, eloquently told his students in one of his ML lectures. The idea that you are going to find the most likely θ implies that θ itself follows a distribution, meaning it is being treated as a random variable. This second method now makes assumptions about which θ is most likely, without having seen the data. This is the prior probability distribution. Bayesian statisticians argue that it’s entirely reasonable to have a prior belief for the value of θ.

如果我们重新审视我们的高矮数据集,这个论点将变得更加清晰。当然,并非所有高斯分布都具有同等可能性。基于我们对世界的了解,我们可以合理地假设,矮人的高斯分布的平均值是 5 英尺,而高人的高斯分布的平均值是 6 英尺或类似的数字。(你明白了。)我们可以对它们的方差做出类似的假设。

The argument will become clearer if we revisit our tall-short dataset. Surely, not all Gaussian distributions are equally likely. We can, based on our knowledge about the world, make a fair assumption that the Gaussian for short people has a mean of, say, five feet, while the Gaussian for tall people has a mean of six feet or something along those lines. (You get the picture.) And we can make similar assumptions about their variances.

有了先验分布和采样数据,我们就可以估计后验分布,也就是最可能的θ,如下所示数据。第二种方法称为 MAP,即“最大后验”估计。这是一种贝叶斯方法,但不是唯一的方法。

With this prior distribution in hand and the sampled data, one can estimate the posterior distribution, which is the most likely θ, given the data. This second method is called MAP, for “maximum a posteriori” estimation. It’s a Bayesian approach, though not the only one.

在 MLE 和 MAP 中,实际的数学运算可能很复杂,但从概念上讲,它涉及以下基本步骤:

In both MLE and MAP, the actual math can get sophisticated, but conceptually, it involves these basic steps:

  • 写下需要最大化的功能,同时纳入任何必要的假设。
  • Write down the function that needs to be maximized, while incorporating any necessary assumptions.
  • 求函数的导数。在 MLE 的情况下,导数是相对于数据x求的;在 MAP 的情况下,导数是相对于θ求的。现在将导数设置为零。(这是函数没有斜率的地方;它处于最大值。当然,导数在最小值处也可以为零,并且有方法可以检查。)现在通过将导数设置为零来求解得到的方程。
  • Take the derivative of the function. In the case of MLE, the derivative is taken with respect to x, the data; in the case of MAP, the derivative is taken with respect to θ. Now set the derivative to zero. (This is the place where the function has no slope; it’s at a maximum. Of course, the derivative can be zero at a minimum, too, and there are ways to check.) Now solve the equation(s) you get by setting the derivative to zero.
  • 有时(事实上,大多数时候)这个问题没有闭式解。在这种情况下,你不是试图找到最大值,而是取函数的负数并尝试找到最小值。我们已经知道如何找到最小值,至少对于凸函数来说是这样:你做梯度下降。这会给你相关的结果。
  • Sometimes—in fact, most times—there’s no closed-form solution to the problem. In this case, instead of trying to find the maximum, you take the negative of the function and try to find the minimum. We already know how to find the minimum, at least for convex functions: You do gradient descent. That’ll give you the relevant results.

当您有大量采样数据时,MLE 非常有效,而 MAP 则在数据较少时效果最佳。随着采样数据量的增加,MAP 和 MLE 在估计底层分布时开始趋于一致。

MLE is powerful when you have a lot of sampled data, while MAP works best with fewer data. And as the amount of sampled data grows, MAP and MLE begin converging in their estimate of the underlying distribution.

我们大多数人都是直觉上的频率论者。但贝叶斯统计学方法非常强大。(注意:贝叶斯统计学与贝叶斯定理不同。即使是频率论者也重视贝叶斯定理。他们只是反对这种对事件有先验信念的想法当尝试从数据中辨别出分布本身的属性时,我们需要了解分布的参数。

Most of us are intuitively frequentists. But the Bayesian approach to statistics is extremely powerful. (Note: Bayesian statistics is not the same as Bayes’s theorem. Even frequentists value Bayes’s theorem. They just object to this whole idea of having prior beliefs about the parameters of a distribution when trying to discern the properties of that very distribution from data.)

首次大规模使用贝叶斯推理进行机器学习的演示之一是由两位统计学家 Frederick Mosteller 和 David Wallace 进行的,他们利用这项技术解决了几个世纪以来一直困扰历史学家的问题:有争议的《联邦党人文集》的作者身份。

One of the first large-scale demonstrations of using Bayesian reasoning for machine learning was due to two statisticians, Frederick Mosteller and David Wallace, who used the technique to figure out something that had been bothering historians for centuries: the authorship of the disputed Federalist Papers.

这些论文是谁写的?

WHO WROTE THEM PAPERS?

1787 年夏天,美国宪法在费城起草数月后,一系列以笔名“普布利乌斯”匿名发表的文章开始出现在纽约州的报纸上。七十七篇这样的文章被发表,旨在说服纽约人批准宪法。这些文章加上另外八篇,总共八十五篇,随后出版成两卷集,名为联邦党人文集:为支持新宪法而撰写的论文集,由 1787 年 9 月 17 日的联邦会议通过。 最终,人们得知这些论文的作者是美国三位“开国元勋”亚历山大·汉密尔顿、约翰·杰伊和詹姆斯·麦迪逊。大约二十年后,汉密尔顿去世(在他与当时的美国副总统阿伦·伯尔之间发生致命决斗后),这些论文开始被分配给个人作者。其中七十篇论文的作者是已知的。但在剩余的论文中,有十二篇被认为是汉密尔顿或麦迪逊写的,三篇被认为是合著的。

Months after the U.S. Constitution was drafted in Philadelphia in the summer of 1787, a series of essays, published anonymously under the pen name “Publius,” began appearing in newspapers in New York State. Seventy-seven such essays were published, written to convince New Yorkers to ratify the Constitution. These essays, plus eight more, for a total of eighty-five, were then published in a two-volume set titled The Federalist: A Collection of Essays, Written in Favour of the New Constitution, as Agreed upon by the Federal Convention, September 17, 1787. Eventually, it became known that the essays had been written by Alexander Hamilton, John Jay, and James Madison, three of the “founding fathers” of the United States. About two decades later, and after Hamilton had died (following a fatal duel between him and Aaron Burr, the then-U.S. vice president), the essays began to be assigned to individual authors. For seventy of the papers, the writers were known. But of the remaining papers, twelve were thought to have been written by either Hamilton or Madison, and three were thought to have been co-authored.

您可能会认为,仍然活着的麦迪逊会清楚地识别出每篇论文的作者。但正如弗雷德里克·莫斯特勒 (Frederick Mosteller) 在《统计的乐趣》中所写, “争议存在的主要原因是麦迪逊和汉密尔顿并不急于提出他们的主张。在撰写这些文章后的几年内,他们成为了政治上的死敌,有时他们还会反对自己撰写的联邦党人著作。”他们的行为就像为客户撰写简报的律师一样,莫斯特勒写道:“他们不需要相信或支持他们提出的支持新宪法的每一个论点。”因此,这 15 份文件的作者身份仍未确定。

You’d think that Madison, who was still alive, would have clearly identified the authors of each paper. But as Frederick Mosteller writes in The Pleasures of Statistics, “the primary reason the dispute existed is that Madison and Hamilton did not hurry to enter their claims. Within a few years after writing the essays, they had become bitter political enemies and each occasionally took positions opposing some of his own Federalist writings.” They behaved like lawyers writing briefs for clients, Mosteller writes: “They did not need to believe or endorse every argument they put forward favoring the new Constitution.” Consequently, the authorship of these fifteen documents remained unresolved.

1941 年,莫斯特勒和一位名叫弗雷德里克·威廉姆斯的政治学家决定解决这个问题。他们研究了麦迪逊和汉密尔顿在作者身份没有争议的论文中使用的句子长度。他们的想法是找出每个作者独特的“签名”——也许一个作者使用的句子比另一个作者更长——然后使用这些签名来检查有争议的论文的句子长度,从而确定它们的作者。但这项努力毫无进展。“当我们汇总已知论文的结果时,汉密尔顿和麦迪逊的平均长度分别为 34.55 和 34.59——这完全是一场灾难,因为这些平均值实际上是相同的,因此无法区分作者。”

In 1941, Mosteller and a political scientist named Frederick Williams decided to tackle the problem. They looked at the lengths of sentences used by Madison and Hamilton in the papers whose authorship was not in dispute. The idea was to identify each author’s unique “signatures”—maybe one author used longer sentences than the other—and then use those signatures to check the sentence lengths of the disputed papers and, hence, their authorship. But the effort led nowhere. “When we assembled the results for the known papers the average lengths for Hamilton and Madison were 34.55 and 34.59, respectively—a complete disaster because these averages are practically identical and so could not distinguish authors.”

莫斯特勒和威廉姆斯还计算了标准差 (SD),这可以衡量句子长度的分布。同样,这两个数字非常接近。汉密尔顿的标准差为 19,麦迪逊的标准差为 20。如果要绘制每个作者的句子长度的正态分布,这两条曲线将大量重叠,几乎没有区分能力。这项工作成为了教学时刻。莫斯特勒在哈佛大学授课时,利用对《联邦党人文集》的分析来教育他的学生应用统计方法的困难。

Mosteller and Williams also calculated the standard deviation (SD), which provided a measure of the spread of the sentence lengths. Again, the numbers were very close. The SD for Hamilton was 19, and 20 for Madison. If you were to draw the normal distribution of sentence lengths for each author, the two curves would overlap substantially, providing little discriminatory power. This work became a teaching moment. Mosteller, while lecturing at Harvard, used this analysis of The Federalist Papers to educate his students on the difficulties of applying statistical methods.

到了 20 世纪 50 年代中期,莫斯特勒和芝加哥大学的统计学家戴维·华莱士开始考虑使用贝叶斯方法进行推理。当时,还没有将贝叶斯分析应用于大型实际问题的例子。

By the mid-1950s, Mosteller and statistician David Wallace, who was at the University of Chicago, began wondering about using Bayesian methods for making inferences. At the time, there were no examples of applying Bayesian analysis to large, practical problems.

就在那时,莫斯特勒收到了历史学家道格拉斯的一封信阿代尔注意到莫斯特勒在哈佛教授的课程。阿代尔希望莫斯特勒重新审视《联邦党人文集》的作者问题。 “[阿代尔]……受到启发,写信建议我(或者更广泛地说,统计学家)应该重新考虑这个问题。他指出,词语可能是关键,因为他注意到汉密尔顿几乎总是使用‘while’的形式,而麦迪逊则使用‘whilst’的形式。唯一的问题是,许多论文都没有包含这两种形式,”莫斯特勒写道。“我们被激励着采取行动。”没有最后期限。“没错,阿代尔急于知道,但历史善于等待。”

It was about then that Mosteller received a letter from the historian Douglass Adair, who had become aware of the courses being taught by Mosteller at Harvard. Adair wanted Mosteller to revisit the issue of the authorship of The Federalist Papers. “[Adair]…was stimulated to write suggesting that I (or more generally, statisticians) should get back to this problem. He pointed out that words might be the key, because he had noticed that Hamilton nearly always used the form ‘while’ and Madison the form ‘whilst.’ The only trouble was that many papers contained neither of them,” Mosteller writes. “We were spurred to action.” There was no deadline. “True, Adair was in a hurry to know, but history is good at waiting.”

他们其中一个有成效的想法是研究所谓的功能词,即具有功能而非意义的词,例如介词、连词和冠词。首先,他们必须统计汉密尔顿和麦迪逊撰写的文件中出现这些词的次数。这是一个费力的过程。在其他人的帮助下,两人开始在一条长纸带上输入每篇文章中的每个单词,每行一个单词。然后,他们开始更费力的任务,将纸带剪成纸条,每张纸条上只有一个单词,然后按字母顺序排列纸条。“那是在 1959-60 年;现在看来,这一切都很原始,甚至可笑,”莫斯特勒写道。“当计数进行时,如果有人打开门,纸条就会在房间里飞来飞去。”

One of their ideas that bore fruit was to look at so-called function words, words that have a function rather than a meaning—prepositions, conjunctions, and articles. First, they had to count the occurrence of such words in documents written by Hamilton and Madison. It was a laborious process. With the help of others, the duo began typing each word from each article, one word per line, on a long paper tape. Then they began the even more laborious task of cutting the paper tape into slips, such that each slip had just one word on it, and then arranging the slips in alphabetical order. “That was in 1959–60; it all seems primitive, even laughable now,” Mosteller writes. “When the counting was going on, if someone opened a door, slips of paper would fly about the room.”

最终他们想出了如何使用计算机来计算单词并按字母顺序排列。但计算机也有自己的特点。“莫斯特勒写道:“在 3000 字左右的某个不确定的点之前,程序做得非常好,然后它就会发疯,毁掉它迄今为止所做的一切。”他补充说:“《联邦党人文集》 1500 字是任何人(甚至是计算机)能承受的极限,不管这些政治文章有多么重要。”就这样,他们一次处理几千字,直到他们处理完对 Hamilton 和 Madison 撰写的大量文章中出现的某些功能词进行计数。

Eventually they figured out how to use a computer to count and alphabetize the words. But the computer came with its own idiosyncrasies. “The program did this beautifully up to some indeterminate point around 3000 words, and then it would go crazy, destroying everything it had done so far,” Mosteller writes, adding that “1500 words of The Federalist was as much as anyone (even a computer) could stand, however important these political writings may be.” And so it went, a few thousand words at a time, until they had the counts for certain function words that appeared in a large number of articles written by Hamilton and Madison.

现在是时候确定其中一篇有争议的论文的作者身份了。他们使用贝叶斯分析来计算两个假设的概率:(1)作者是麦迪逊,(2)作者是汉密尔顿。如果假设 1 的概率较大,则作者更有可能是麦迪逊。否则,作者就是汉密尔顿。取一个功能词,比如“upon”,计算给定该词的假设 1 的概率和给定该词的假设 2 的概率,并适当地确定作者身份。当然,同时使用多个词会使分析更加清晰。

Now it was time to figure out the authorship of one of the disputed papers. They used Bayesian analysis to calculate the probability of two hypotheses: (1) the author is Madison, and (2) the author is Hamilton. If hypothesis 1 has a greater probability, the author is more likely to be Madison. Otherwise, it’s Hamilton. Take one function word, say, “upon,” and calculate the probability of hypothesis 1 given the word and hypothesis 2 given the word, and ascribe authorship appropriately. Of course, using multiple words at once makes the analysis sharper.

这里的关键见解是,给定麦迪逊的一堆已知文档,某些词(例如“upon”)的使用遵循分布。麦迪逊在某些文档中使用这个词较多,在其他文档中使用较少。汉密尔顿也是如此。正如我们在句子长度问题中看到的那样,如果这些分布相似,则不能用来区分作者。但如果它们不同,它们就具有区分能力。莫斯特勒雄辩地阐述了这一点:“两位作者的词频分布相差越大,该词的区分能力越强。这里,by 的区分能力优于to,而 to 又优于from。”

The key insight here is that given a bunch of known documents by Madison, the usage of some word, such as “upon,” follows a distribution. Madison used the word more in some documents, less so in others. The same can be said of Hamilton. As we saw in the issue with sentence length, if these distributions are alike, they cannot be used to tell the authors apart. But if they are different, they possess the power to discriminate. Mosteller makes this point eloquently: “The more widely the distributions of rates [of words] of the two authors are separated, the stronger the discriminating power of the word. Here, [the word] by discriminates better than [the word] to, which in turn is better than [the word] from.”

随后,莫斯特勒和华莱士将来自词率的证据转化为适合进行统计推断的模型。他们尝试了各种模型;每个模型都隐含了关于潜在概率分布的某些假设,从而使数学变得容易处理。结果是一致的。“无论使用何种方法,结果都是一样的:压倒性的证据证明麦迪逊是这些有争议的文件的作者。我们的数据独立地补充了历史学家的证据。从可信度的角度来看,麦迪逊极有可能是这些有争议的联邦党人文集的作者,并且可能除了第 55 号论文外,我们的证据表明麦迪逊的胜率为 80 比 1——虽然很强劲,但并非压倒性优势。”

Mosteller and Wallace then turned such evidence from word rates into an appropriate model to do statistical inference. They tried various models; each model implied certain assumptions about the underlying probability distribution that made the math tractable. The results were unanimous. “By whatever methods are used, the results are the same: overwhelming evidence for Madison’s authorship of the disputed papers. Our data independently supplement the evidence of the historians. Madison is extremely likely, in the sense of degree of belief, to have written the disputed Federalist papers, with the possible exception of paper number 55, and there our evidence yields odds of 80 to 1 for Madison—strong, but not overwhelming.”

宾夕法尼亚州匹兹堡杜肯大学计算机科学教授、当代文体学(利用写作风格变化的统计数据来确定作者身份)专家帕特里克·朱奥拉 (Patrick Juola) 表示,莫斯特勒和华莱士的工作对统计学家来说是一个开创性的时刻。“它在统计理论中非常有影响力。他们受到的赞扬是名副其实的,”Juola 告诉我。“历史学家们已经研究这个问题一百年了。他们做出的决定大多和 Mosteller 和 Wallace 一样。他们的研究之所以如此具有开创性,是因为这是第一次以完全客观、算法的方式进行研究,也就是说,这是机器学习[斜体字是我加的]

Patrick Juola, professor of computer science at Duquesne University in Pittsburgh, Pennsylvania, and a modern-day expert in stylometry (the use of the statistics of variations in writing style to determine authorship), said that Mosteller and Wallace’s work was a seminal moment for statisticians. “It was very influential in statistical theory. And they were justifiably lauded,” Juola told me. “Historians had been looking at the problem for a hundred years. And the historians had mostly come to the same decisions that Mosteller and Wallace did. And what made [their] study so groundbreaking was [that] for the first time, this was done in a completely objective, algorithmic fashion, which is to say it was machine learning [italics mine].

一群企鹅

A WADDLE OF PENGUINS

为了进一步了解机器学习的概率性质,让我们前往南极洲的帕尔默群岛。在那里,海洋生物学家克里斯汀·戈尔曼领导的团队精心收集了来自三个岛屿的 334 只企鹅的数据。没错,就是企鹅。三种企鹅:阿德利企鹅、巴布亚企鹅和帽带企鹅。每只企鹅都有其物种名称、居住的岛屿、喙长(毫米)、喙深(毫米)、鳍长(毫米)、体重(克)、性别(雄性或雌性)和研究年份的数据由戈尔曼和她的同事艾莉森·霍斯特 (Allison Horst) 和艾莉森·希尔 (Alison Hill) 免费提供。

To get further insights into the probabilistic nature of machine learning, let’s head to the Palmer Archipelago, Antarctica. There, a team led by marine biologist Kristen Gorman painstakingly collected data on 334 penguins from three islands. Yes, penguins. Three species of penguins: Adélie, Gentoo, and Chinstrap. Each penguin is characterized by its species name, the island where it lives, its bill length (in millimeters), its bill depth (mm), its flipper length (mm), its body mass (grams), its sex (male or female), and the year it was studied, data that were made freely available by Gorman and her colleagues Allison Horst and Alison Hill.

为了我们的目的,让我们忽略关于企鹅起源岛屿和研究年份的信息。如果没有这些数据,每只企鹅都有这五个属性:喙长、喙深、鳍长、体重和性别。这些属性可以被认为是向量 [ x1 , x2 , x3 , x4 , x5 ] 的分量,其中x1 = 喙长度,x2 = 喙深度,依此类推。

For our purposes, let’s disregard the information about the island of origin and the year the penguin was studied. Sans those data, each penguin has these five attributes: bill length, bill depth, flipper length, body mass, and sex. These attributes can be thought of as the components of a vector [x1, x2, x3, x4, x5], where x1 = bill length, x2 = bill depth, and so on.

因此,每只企鹅都用一个向量x = [ x1 , x2 , x3 , x4 , x5 ]来表示。每只企鹅都有一个标签y = 阿德利企鹅 (0)、巴布亚企鹅 (1) 或帽带企鹅 (2)。因此,企鹅是 5D 空间中的一个点,该点可以归类为属于三个物种之一。

So, each penguin is characterized by a vector x = [x1, x2, x3, x4, x5]. And each penguin has a label y = Adélie (0), Gentoo (1), or Chinstrap (2). A penguin, then, is a point in 5D space, and that point can be classified as belonging to one of three species.

机器学习算法的任务是学习企鹅的属性与这些属性所属物种之间的相关性。基本上,该算法需要学习将x映射到y 的函数。

The task for an ML algorithm is to learn the correlation among the attributes of a penguin and the species to which those attributes belong. Basically, the algorithm needs to learn a function that maps x to y.

fx)= y

f(x) = y

然后,给定一个新的x,它可以得出y的预测值(0、1 或 2)。或者,我们可以说 ML 算法必须估计底层分布 P ( X , y )。然后,给定一些新数据x,它可以计算 P ( y = Adélie | x ) 和 P ( y = Gentoo | x ) 和 P ( y = Chinstrap | x )。考虑到新的未知企鹅的属性,三个条件概率中最大的一个为我们提供了最可能的物种。

Then, given a new x, it can spit out the predicted value for y (0, 1, or 2). Or, we can say that the ML algorithm must estimate the underlying distribution P (X, y). And then, given some new data x, it can calculate P (y = Adélie | x) and P (y = Gentoo | x) and P (y = Chinstrap | x). The largest of the three conditional probabilities gives us the most likely species, given the new, unknown penguin’s attributes.

暂且先以两种物种为例。如果 5D 数据是线性可分的(即可以绘制一个 4D 超平面,将阿德利企鹅的数据与 5D 坐标空间中的帽带企鹅的数据清晰地分开),那么我们就可以使用感知器算法来找到该超平面。然后,给定一只新的、尚未分类的企鹅的数据,感知器可以告诉我们企鹅是落在超平面的一侧还是另一侧,并据此对其进行分类。但我们现在知道,感知器算法会找到无限多个可能的超平面之一。感知器对新数据进行分类的能力可能容易出错。阿德利企鹅可能会被归类为帽带企鹅,反之亦然。

For a moment, let’s take just two species. If the 5D data are linearly separable—meaning one can draw a 4D hyperplane that cleanly separates, say, the data representing Adélie penguins from those of the Chinstrap penguins in 5D coordinate space—then we can use a perceptron algorithm to find that hyperplane. Then, given data about a new, as-yet-unclassified penguin, the perceptron can tell us whether the penguin falls on one side of the hyperplane or the other and classify it accordingly. But we know by now that the perceptron algorithm finds one of the infinitely many possible hyperplanes. The perceptron’s ability to classify new data can be error-prone. An Adélie could get classified as a Chinstrap, and vice versa.

但我们面临着一个更大的问题:线性可分数据。首先,即使代表几百只企鹅的数据是线性可分的,如果你继续收集越来越多的数据,也不能保证这种区别会成立。但即使只抽取少量企鹅样本,就像这里的情况一样,很明显这个假设是可疑的。我们可以亲眼看看。

But there’s a bigger problem staring at us: the assumption of linearly separable data. For starters, even if the data representing a few hundred penguins were linearly separable, there’s no guarantee the distinction would hold if you kept collecting more and more data. But even with a small sampling of penguins, as is the case here, it’s clear that the assumption is suspect. We can see it for ourselves.

二维以上的数据很难可视化,因此让我们仅使用两个特征(例如喙长和喙深)以不同的方式查看数据,以了解问题所在。首先,这是两种企鹅(阿德利企鹅和巴布亚企鹅)的图,它们以这两个属性为特征。在这两个维度上,这两个物种显示出分离,但只是大致如此。一只巴布亚企鹅的属性值与阿德利企鹅非常相似;如果不是因为一个异常值,分离会更明显。

It’s hard to visualize data in anything more than two dimensions, so let’s look at the data in different ways using only two features—say, bill length and depth—to get a sense of the problem. First, here’s a plot of two species of penguins, Adélie and Gentoo, characterized by those two attributes. In these two dimensions, the two species show a separation, but only just about. One Gentoo penguin has attribute values that are very much like those of the Adélie penguins; if it weren’t for the one outlier, the separation would be cleaner.

为了混合起来,让我们在同一张图表上绘制所有三个物种,代表相同的两个属性(见对面的图)。

To mix things up, let’s plot all three species on the same graph, for the same two attributes (see figure opposite).

现在情况更加混乱了。如果我们要建立一个机器学习模型来将这三种企鹅区分开来,结果将不会一目了然。即使企鹅样本有限,数据也会重叠,尤其是当我们比较阿德利企鹅和帽带企鹅或帽带企鹅和巴布亚企鹅时。

The picture is much more muddled now. If we had to build an ML model that could cleanly separate the three species of penguins from one another, it’s not going to be clear-cut. Even with this limited sampling of the penguins, the data overlap, particularly when we compare the Adélie and Chinstrap penguins or the Chinstrap and Gentoo penguins.

因此,我们构建的任何分类器,只要它学习函数f ( x ),将企鹅的特征映射到其物种,就总会犯一些错误。或者,换个角度来看,如果给分类器提供了关于企鹅的新数据,并且分类器预测企鹅是阿德利企鹅,那么这个预测就有一定概率是错误的。我们的任务是构建一个分类器,使错误的可能性最小化。

So, any classifier we build that learns the function f(x) to map the features of a penguin to its species will always make some mistakes. Or, looking at it another way, if the classifier is given new data about a penguin, and the classifier predicts that the penguin is, say, an Adélie, then that prediction will have a certain probability of being wrong. Our task is to build a classifier that minimizes the likelihood of a mistake.

这就是机器学习领域(称为贝叶斯决策理论)的用武之地。它为我们根据手头的数据所能做到的最好事情设定了界限。但首先,为了让事情更容易可视化和掌握,让我们将数据缩减为只有一个属性。

This is where an entire field of machine learning, called Bayesian decision theory, comes in. It establishes the bounds for the best we can do given the data at hand. But first, to make things even easier to visualize and grasp, let’s whittle the data down to just one attribute.

请考虑以下直方图,其中显示了阿德利企鹅的数量与其喙深度的关系(见下文)。

Consider the following histogram showing the numbers of Adélie penguins plotted against their bill depth (see below).

喙深度值有 10 个箱体,每个箱体中都包含一些阿德利企鹅,它们的喙深度属于该箱体。该图可让您大致了解阿德利企鹅喙深度的分布情况。

There are ten bins for the value of bill depth, and in each bin, there are some Adélie penguins whose bill depths belong in that bin. The plot gives you a rough sense of the distribution of bill depths of Adélie penguins.

在自然界中,分布可能是连续的。下图显示了一个正态钟形分布,其平均值和标准差与我们所拥有的数据相吻合。(请注意,y 轴现在显示的是“概率密度”,而不是“数量”企鹅”——它为我们提供了一种方法来了解喙的深度对于阿德利企鹅来说具有某种价值的可能性。)

In nature, the distribution would likely be continuous. The chart that follows shows a normal bell-shaped distribution with a mean and standard deviation fitted to the data we have. (Notice that the y-axis now reads “probability density,” rather than “number of penguins”—it’s giving us a way to get at the probability that the bill depth has some value in Adélie penguins.)

让我们想象一下,上面的平滑曲线是阿德利企鹅喙深度的实际潜在分布。如果我们可以访问该曲线,它将使我们能够计算出企鹅是阿德利企鹅时喙深度具有某些值的概率。(在这里,我在谈论阿德利企鹅具有特定喙深度的概率时有点不精确;我们从之前的分析中知道,当分布是连续的时,我们只能谈论喙深度的某个范围的概率。)因此,我们计算所谓的类条件概率,即基于企鹅属于某个特定类别(在本例中为阿德利企鹅)这一事实的概率。

Let’s imagine for a moment that the smooth curve above is the actual, underlying distribution of bill depths of Adélie penguins. If we had access to that curve, it would allow us to calculate the probability that the bill depth has some value, given the penguin is an Adélie. (I’m being a bit imprecise here, in talking about the probability of an Adélie penguin having a specific bill depth; we know from our earlier analysis that when the distribution is continuous, we can talk of the probability only for some range of bill depths.) So, we calculate something called a class-conditional probability, meaning, a probability conditioned on the fact that the penguin belongs to a particular class, in this case Adélie.

因此,分布给出了 P ( x | y = Adélie)。理想情况下,P 需要一个下标,表示它是喙深度的概率,但从上下文中可以清楚地看出,所以我将避开下标。

So, the distribution gives us P (x | y = Adélie). Ideally, the P needs a subscript, indicating that it’s the probability for bill depth, but it’s clear from the context, so I’m going to eschew the subscript.

我们可以对巴布亚企鹅做同样的事情。以下是采样数据的图表:

We could do the same for Gentoo penguins. Here’s a plot of the sampled data:

我们可以对数据进行正态分布拟合:

We can fit a normal distribution to the data:

再次假设曲线是自然界真实的、巴布亚企鹅喙深度的基本分布,并假设我们可以获得该分布,那么我们可以计算某个x

Again, assuming the curve is nature’s real, underlying distribution of bill depths of Gentoo penguins, and assuming we have access to that distribution, then we can calculate, for some x:

P ( x | y = Gentoo)

P (x | y = Gentoo)

以下是同一图中并排的两条曲线:

Here are the two curves in the same plot, side by side:

很明显,巴布亚企鹅的平均喙长度较小,而且喙长度在平均值附近更为紧密,而阿德利企鹅的平均喙长度较大。从进行预测的角度来看,如果我们能够获得这些假定的潜在自然分布,然后我们得到一只身份不明的企鹅的喙长度——我们只知道它是巴布亚企鹅还是阿德利企鹅——那么我们可以计算出两个概率:

It’s clear that Gentoo penguins have a smaller mean bill depth and that the bill depths are somewhat more tightly clustered around the mean than they are for Adélie penguins, which have a larger mean bill depth. From the perspective of making predictions, if we had access to these supposed underlying natural distributions, and we were then given the bill depth for an unidentified penguin—all we were told is that it’s either a Gentoo or an Adélie—then we could calculate two probabilities:

P ( y = Gentoo | x ),表示给定喙深度的值时,该企鹅为 Gentoo 企鹅的概率;
P (y = Gentoo | x), which is the probability that the penguin is a Gentoo, given a value for the bill depth; and
P ( y = Adélie | x ),给定企鹅喙的长度,表示该企鹅是阿德利企鹅的概率。
P (y = Adélie | x), the probability that the penguin is an Adélie, given the bill depth.

如果前者较大,我们预测企鹅是巴布亚企鹅;如果后者值较大,我们预测它是阿德利企鹅。但是我们如何计算这两个概率呢?这就是贝叶斯定理的用武之地。

If the former is greater, then we predict that the penguin is a Gentoo; if the latter value is larger, then we predict that it’s an Adélie. But how do we calculate the two probabilities? This is where Bayes’s theorem comes in.

回想一下,就假设(H)和证据(E)而言:

Recall that, in terms of a hypothesis (H) and evidence (E):

我们有两个假设:(1) 这只企鹅是巴布亚企鹅,(2) 这只企鹅是阿德利企鹅。我们拥有的证据是喙的深度。

We have two hypotheses: (1) The penguin is a Gentoo, and (2) the penguin is an Adélie. The evidence we have is some bill depth.

因此,我们必须计算两个假设的概率:

So, we must calculate the probabilities of each of our two hypotheses:

让我们解决第一个假设,看看如何得出它的概率。这涉及计算等式右侧每个项的值。

Let’s tackle the first hypothesis and see how we can arrive at the probability for it. It involves calculating the values for each term on the right-hand side of the equation.

P ( y = 巴布亚企鹅):这只是企鹅是巴布亚企鹅的“先验”概率。我们可以根据手头的数据估算出这个概率。在我们的企鹅样本中,有 119 只巴布亚企鹅和 146 只阿德利企鹅。因此,企鹅是巴布亚企鹅的先验概率估计为 119 / (119+146) = 0.45。

P (y = Gentoo): This is simply the “prior” probability that the penguin is a Gentoo. We can estimate this from the data at hand. In our sample of penguins, there were 119 Gentoo penguins and 146 Adélie penguins. So, an estimate of the prior probability that a penguin is a Gentoo is simply 119 / (119+146) = 0.45.

P ( x | y = Gentoo):我们可以从上面描绘的分布中读出这个值。在 x 轴上找到喙的深度,并在图中“Gentoo”部分的 y 轴上找出概率。

P (x | y = Gentoo): We can read off this value from the distribution depicted above. Find the bill depth on the x-axis and figure out the probability on the y-axis on the “Gentoo” part of the plot.

P (x):这是钞票具有某些特定特征的概率深度。就像我们计算某人在接受疾病测试时检测结果为阳性的概率(我们必须考虑真阳性和假阴性)一样,我们也可以考虑到我们面对的是两种类型的企鹅。所以:

P(x): This is the probability that the bill has some particular depth. In much the same way that we calculated the probability that someone could test positive when administered a test for a disease (where we had to account for both true positives and false negatives), we can also take into account that we are dealing with two types of penguins. So:

P( x ) = P ( x | 阿德利企鹅) × P (阿德利企鹅) + P ( x | 巴布亚企鹅) × P (巴布亚企鹅)
P(x) = P (x | Adélie) × P (Adélie) + P (x | Gentoo) × P (Gentoo)

我们可以从分布中读出 P ( x | Adélie)。P (Adélie) 是企鹅是 Adélie 的先验概率。我们知道如何计算它。巴布亚企鹅也是如此。另外,请注意,P( x ) 对于两个假设都是相同的,因此在计算中经常被忽略,特别是在难以确定甚至无法确定的情况下。

We can read off P (x | Adélie) from our distribution. P (Adélie) is the prior probability that a penguin is an Adélie. We know how to calculate this. The same goes for Gentoo penguins. Also, note that P(x) is the same for both hypotheses and is therefore often ignored in the calculations, especially in situations where it can be tricky or even impossible to determine.

利用这些数据,我们可以计算出 P ( y = Gentoo | x ):给定某个喙深度x,该企鹅是 Gentoo 的“后验”概率。

Using these data, we can calculate P (y = Gentoo | x): the “posterior” probability that the penguin is a Gentoo, given some bill depth x.

我们可以对阿德利企鹅进行同样的分析,在给定相同喙深度x 的情况下,得到企鹅是阿德利企鹅的后验概率。然后,我们根据后验概率较高的那个来预测企鹅是阿德利企鹅还是巴布亚企鹅。

We can do the same analysis for Adélie penguins and get the posterior probability that the penguin is an Adélie, given the same bill depth x. Then, we predict that the penguin is an Adélie or a Gentoo based on whichever posterior probability is higher.

我们刚刚分析的这个简单分类器,只有一个企鹅特征(喙深度)和两种企鹅类型,被称为贝叶斯最优分类器。它是任何机器学习算法所能做到的最好的。在我们的分析中,结果取决于了解或估计数据的底层分布。

This simple classifier that we just analyzed, with only one feature of penguins (bill depth) and two types of penguins, is called the Bayes optimal classifier. It’s the best any ML algorithm can ever do. And in our analysis, the result was contingent upon knowing or estimating the underlying distribution of data.

但即使是贝叶斯最优分类器也会出错。假设你收到一只未分类的企鹅的数据,它的喙深约为 16 毫米。我们的计算可能会得出,例如,企鹅是巴布亚企鹅的概率为 0.8,是阿德利企鹅的概率为 0.2。因此,使用我们的算法,我们预测penguin 是 Gentoo,但我们有 20% 的几率会出错。从数学上可以证明,这种错误无法通过任何其他技术来改善,因此它为预测风险设定了一个下限。

But even the Bayes optimal classifier makes errors. Let’s say you are handed data about an unclassified penguin, and it has a bill depth of about 16 mm. Our calculations might come up with, for example, a probability of 0.8 that the penguin is a Gentoo and 0.2 that it’s an Adélie. So, using our algorithm, we predict that the penguin is a Gentoo, but there’s a 20 percent chance we are wrong. It can be shown mathematically that this error cannot be improved upon by any other technique, so it sets a lower bound for the prediction risk.

如果这一切看起来太简单或微不足道,让我们开始泼冷水。首先,我们假设我们能够访问或估计底层的自然分布。当然,我们几乎从未访问过真正的底层分布——这就是我们首先进行机器学习的原因。正如我们之前所看到的,ML 算法基本上是试图从数据采样中获得底层分布的最佳近似值。有很多方法可以做到这一点,包括我们之前看到的两种:最大似然估计 (MLE) 和最大后验 (MAP) 估计。

If all this seems too easy, or trivial, let’s start pouring cold water. For starters, we assumed that we had access to or could estimate the underlying natural distributions. Of course, we almost never have access to the true underlying distribution—which is why we do machine learning in the first place. As we saw earlier, ML algorithms are basically trying to get at the best possible approximation of the underlying distribution from a sampling of data. There are any number of ways of doing this, including two we saw earlier: maximum likelihood estimation (MLE) and maximum a posteriori (MAP) estimation.

此外,在我们分析的案例中,喙的长度可以帮助区分阿德利企鹅和巴布亚企鹅,而不会产生巨大的误差。但如果比较阿德利企鹅和帽带企鹅,情况就不一样了。这是两种企鹅物种的叠加图。如果仅使用喙的长度进行评估,它们是无法区分的。即使是贝叶斯最优分类器也会在很多时候出错。

Also, it just so happens that in the case we analyzed, bill depth can help distinguish Adélie penguins from Gentoo penguins without huge errors being made. But that’s not the case if one compares Adélie and Chinstrap penguins. Here’s a plot of the two penguin species, overlaid. They are indistinguishable when evaluated using only bill depth. Even the Bayes optimal classifier will get things wrong a lot of the time.

这就是额外特征发挥作用的地方。如果我们将喙长度添加到组合中,并在 xy 平面上绘制两种类型的企鹅,我们将得到以下结果:

This is where extra features come in. If we were to add bill length to the mix, and plot the two types of penguins on the xy plane, here’s what we’d get:

圆圈(阿德利企鹅)和三角形(帽带企鹅)之间有一些重叠,但两者大多是不同的群体。如果我们可以估计底层分布,我们可以再次构建贝叶斯最优分类器。获得这些分布需要一些脑力劳动。就像我们拟合一个钟形曲线来表示喙深度的概率密度函数一样,我们可以拟合一个钟形曲面来表示喙长度和深度的 PDF。

There’s some overlap between the circles (Adélie) and triangles (Chinstrap), but the two are mostly distinct groups. We can again build a Bayes optimal classifier if we can estimate the underlying distributions. Getting to these distributions is going to take some mental calisthenics. Just as we fit a bell-shaped curve to represent the probability density function for bill depth, we can fit a bell-shaped surface to represent the PDF for bill length and depth taken together.

例如,上面就是这样的 3D 图。底部的 2D 表面显示了点分布的“热图”:越白靠近中心的区域会有更多的点,而向外移动,朝向更暗的区域,点会变得越来越少。3D 表面是概率密度函数。

For example, above is such a 3D plot. The 2D surface at the bottom shows the “heat map” of the distribution of points: The whiter regions closer to the center will have more points, and the points will get scarcer as you move outward, toward darker regions. The 3D surface is the probability density function.

现在想象一下这两种企鹅喙长和喙深的图的这种表面。将有两个表面,一个以圆圈为中心,另一个以三角形为中心。这些表面的确切形状会略有不同。即使只是目测二维散点图,你也可以想象阿德利企鹅的表面会更圆更散,而帽带企鹅的表面会更窄更椭圆。这两个表面会重叠,就像我们只对喙深感兴趣时两条曲线重叠一样。

Now imagine such surfaces for our plot of bill length and depth for the two types of penguins. There would be two, one centered over the circles and another over the triangles. The precise shapes of these surfaces would be slightly different. Even by simply eyeballing the 2D scatter plot, you can imagine that the surface over the Adélie penguins would be more rounded and spread out, whereas the one over the Chinstrap penguins would be narrower and more ellipsoidal. Those two surfaces would overlap, just like the two curves did when we were interested only in bill depth.

如果我们能够估计这些基础分布,我们就能将一只新企鹅(根据其喙的长度和深度)归类为阿德利企鹅或帽带企鹅。同样,这种分类也容易出错,但这是我们能做到的最好了。

If we could estimate those underlying distributions, we’d be able to classify a new penguin (given its bill length and depth) as either an Adélie or a Chinstrap. And in much the same way, the classification would be error-prone, but it’s the best we can do.

这一切看起来一定很简单(其中一些简单之处可能与令人放松的企鹅及其易驯服的特性有关)。但让我们开始思考为什么这种方法,即贝叶斯最优分类器,在计算上很快就变得不可能了。

All this must still seem straightforward (and some of the ease might have to do with the disarming penguins and their tractable characteristics). But let’s start thinking about why this method, the Bayes optimal classifier, gets computationally impossible very quickly.

这种方法的核心是能够根据一组特征估计概率分布。对于一个特征,即喙的深度,我们必须估计二维函数的形状。也许一组大约一百只特定类别的企鹅就足以很好地了解底层函数。然后,我们将特征集增加到两个(喙的长度和深度),并且必须估计三维表面的形状。相同的样本量,每个类别一百只企鹅,可能仍然足以获得该三维表面。但是随着我们添加更多特征,样本量就成了一个问题。

At the heart of this approach is the ability to estimate probability distributions given a set of features. For one feature, bill depth, we had to estimate the shape of a 2D function. Maybe a set of one hundred or so penguins of a certain class would be enough to get a good sense of the underlying function. We then bumped up the set of features to two (bill length and depth) and had to estimate the shape of a 3D surface. The same sample size, one hundred penguins per class, may still be adequate to get at that 3D surface. But the sample size becomes an issue as we add more features.

在实际的机器学习问题中,特征的数量可能是数十、数百、数千甚至更多。这就是问题的范围变得令人恐惧。在越来越高的维度上以合理的精度估计概率分布的形状将需要越来越多的数据。几百个样本是不够的。而且,如果可能的话,你添加的数据越多,估计分布所需的计算量就越大。

In real-life ML problems, the features can number in the tens, hundreds, thousands, or even more. This is where the scope of the problem becomes terrifying. Estimating the shape of the probability distribution with reasonable accuracy in higher and higher dimensions is going to require more and more data. A few hundred samples are not going to cut it. And the more data you add, the more compute-intensive estimating the distribution gets, if it’s possible at all.

因此,需要进行一些简化。

So, some simplifications are in order.

天真无邪

IT PAYS TO BE NAÏVE

让我们重新表述一下我们的问题。如果我们有五个特征用来描述企鹅(喙深度、喙长度、鳍长度、体重和性别),那么我们会发现每只企鹅本质上都是 5D 空间中的一个向量(一个点)。特征向量x是:

Let’s restate our problem. If we have, say, five features that are being used to describe a penguin (bill depth, bill length, flipper length, body mass, and sex), then we see that each penguin is essentially a vector (a point) in 5D space. The feature vector x is:

[ x1 , x2 , x3 , x4 , x5 ]

[x1, x2, x3, x4, x5]

如果我们获得了一些尚未确定的企鹅的这些属性,那么我们的工作就是找出:

If we are given these attributes for some as-yet-unidentified penguin, then our job is to figure out:

  • P ( y = Adélie | x ):给定证据或特征向量x ,企鹅是阿德利企鹅的概率;
  • P (y = Adélie | x): the probability the penguin is an Adélie, given the evidence, or feature vector, x;
  • P ( y = Gentoo | x ):给定x,企鹅是 Gentoo 的概率;以及
  • P (y = Gentoo | x): the probability the penguin is a Gentoo, given x; and
  • P ( y = Chinstrap | x ):给定x,企鹅为 Chinstrap 的概率。
  • P (y = Chinstrap | x): the probability the penguin is a Chinstrap, given x.

采取以下任意一个计算:

Take any one of these computations:

P ( y = 阿德利 | x ) = P ( y = 阿德利 | x1 , x2 , x3 , x4 , x5 )
P (y = Adélie | x) = P (y = Adélie | x1, x2, x3, x4, x5)

计算该值首先需要估计类条件概率密度函数 P ( x | y = Adélie):

Computing this will require first estimating the class condition probability density function P (x | y = Adélie):

P ( x | y = 阿德利) = P ( x1 , x2 , x3 , x4 , x5 | y = 阿德利)
P (x | y = Adélie) = P (x1, x2, x3, x4, x5 | y = Adélie)

这是一个复杂的六维曲面,取决于所有五个特征。我们已经确定,在数据样本有限和计算资源受限的情况下,重建或估计这个曲面几乎是不可能的。

This is a complicated surface in six dimensions, which depends on all five features. We have already established that reconstructing or estimating this is near impossible with limited samples of data and constrained computational resources.

这是统计学家和概率论者用来使问题更易于处理的一个技巧。他们假设所有特征都是从它们自己的分布中独立采样的。因此,阿德琳企鹅喙深度的值是从喙深度的底层分布中独立采样的值,而喙长度的值是从喙长度的底层分布中独立采样的值。这意味着,比如说,喙深度的变化与喙长度的变化无关。当然,这在自然界中永远不会真正成立。但在简化数学方面,这个假设非常有效。鉴于这种相互独立的特征假设,我们可以使用贝叶斯定理来得到我们想要的结果:

Here’s a trick that statisticians and probability theorists use to make the problem more tractable. They assume that all features are sampled from their own distributions independently of one another. So, the values for bill depth in Adéline penguins are values sampled independently from an underlying distribution for bill depth alone, and the values for bill length are values sampled independently from an underlying distribution for bill length alone. This implies that the variation in, say, bill depth has nothing to do with variation in bill length. This is, of course, never really going to be true in nature. But it’s an assumption that works wonders when it comes to making the mathematics easier. Given this assumption of mutually independent features, we can use Bayes’s theorem to get at what we desire:

我们需要估计或先验知道的函数以及在高维中引起问题的函数是:

The function that we need to estimate or know a priori and one that causes problems in higher dimensions is:

P( x | y = 阿德利) = P( x1, x2, x3, x4, x5 | y = 阿德利)

P(x | y = Adélie) = P(x1, x2, x3, x4, x5 | y = Adélie)

相互独立的假设使任务变得更简单。根据该假设(并使用 A 代表 Adélie):

The mutual independence assumption makes the task simpler. Given that assumption (and using A for Adélie):

P( x1, x2, x3, x4, x5 | y = A)

P(x1, x2, x3, x4, x5 | y = A)

= P( x1 | y = A) × P( x2 | y = A) × P( x3 | y = A) × P( x4 | y = A) × P( x5 | y = A)

= P(x1 | y = A) × P(x2 | y = A) × P(x3 | y = A) × P(x4 | y = A) × P(x5 | y = A)

这个问题被分解成许多子问题,每个子问题都涉及估计一个特征或随机变量的概率分布。这可以用更少的样本来完成,而且计算量要小得多。使用更紧凑的数学符号,我们有:

The problem has been broken down into many sub-problems, each involving the estimation of a probability distribution for just one feature, or random variable. This can be done with fewer samples and is computationally far less intensive. Using more compact mathematical symbols, we have:

P( x | y =A)

P(x|y = A)

“pi” 符号表示乘法;它相当于加法的“sigma”符号。有了这个简化,我们可以计算各种类条件概率(给定 5D 证据x ,企鹅是阿德利企鹅;给定x ,企鹅是巴布亚企鹅;给定x,企鹅是帽带企鹅),然后基本上根据最高条件概率预测企鹅种类。这种假设特征相互独立的分类器称为朴素贝叶斯分类器,或者有些贬义地称为白痴贝叶斯分类器。但它是一种强大的技术,在许多情况下都很有效。例如,这种分类器可以很好地预测电子邮件是否为垃圾邮件。

The “pi” symbol signifies multiplication; it’s multiplication’s equivalent of the “sigma” symbol for addition. With this simplification in hand, we can compute the various class-conditional probabilities (the penguin is an Adélie given the 5D evidence, x; the penguin is a Gentoo given x; and a Chinstrap given x), and then basically predict the type of penguin based on the highest conditional probability. Such a classifier, with the assumption of mutually independent features, is called a naïve Bayes or, somewhat pejoratively, an idiot Bayes classifier. But it’s a powerful technique that works well in many situations. For example, such a classifier does well in predicting whether an email is spam.

现在,你可能会想知道,甚至担心,即使这些特征被视为相互独立,我们究竟如何估计概率分布。好吧,我们已经看到了一个例如,当我们为给定的企鹅类型拟合喙深度直方图的曲线时。让我们重新回顾一下:

By now, you may be wondering, even worried about, how exactly we estimate the probability distributions, even if the features are taken to be mutually independent. Well, we have already seen an example of this, when we fit a curve over the histogram of bill depths for a given penguin type. Let’s revisit it:

我们假设底层分布是高斯分布或正态分布,我们基本上使用采样的巴布亚企鹅数据(其喙深度)来找到一条具有均值和标准差的曲线——这些构成了分布的参数θ——最适合数据。一旦我们有了θ,我们就有了底层分布的估计值,我们不需要更多的数据。在这种情况下,我们将随机变量建模为连续变量。我们对每个特征都这样做,并使用每个概率密度函数进行计算。

We assume that the underlying distribution is a Gaussian, or normal, distribution, and we basically use the sampled Gentoo penguin data (of their bill depths) to find a curve with a mean and standard deviation—these constitute the parameter θ of the distribution—that best fits the data. Once we have θ, we have an estimate of the underlying distribution, and we don’t need any more data. In this case, we have modeled our random variable as continuous. We do this for each feature and use each probability density function for the calculations.

或者我们也可以简单地将每个特征建模为离散随机变量。下图中有十个箱子。每个箱子中的企鹅喙的深度都在一定范围内。

Or we can also simply model each feature as a discrete random variable. In the figure below, there are ten bins. Each bin has penguins with bill depths that fall within a certain range.

例如,在总共 119 只巴布亚企鹅中,第四个箱子中有 19 只。因此,喙深度落在第四个箱子中的概率为 19/119 = 0.16。同样,喙深度落在第八个箱子中的概率为 11/119 = 0.09。依此类推……

For example, there are 19 Gentoo penguins in the fourth bin, out of a total of 119. So, the probability that the bill depth falls in the fourth bin is 19/119 = 0.16. Similarly, the probability that the bill depth falls in the eighth bin is 11/119 = 0.09. And so on…

计算离散随机变量的概率,我们之前看到的,给出了概率质量函数。我们可以使用这些概率进行预测。显然,样本越多,直方图就越能代表真实的潜在分布,从而使预测更加准确。

Calculating such probabilities for the discrete random variable, we saw earlier, gives us the probability mass function. We can use these probabilities to make our predictions. It’s obvious that the more samples we have, the better the histogram will be in representing the true underlying distribution and, thus, will make the predictions more accurate.

包起来

WRAP-UP

如果对概率和统计学的探索让你感到有些不知所措,这是可以理解的,特别是如果你像我一样,对这个主题知之甚少。我们试图在一章的篇幅内解决两个庞大的数学领域,并将它们与机器学习联系起来。如果感觉不到太多,我会感到惊讶。即使其他一切都变得模糊,我们也可以从这一章中得到一些简洁的概念信息。

It’s understandable if this foray into probability and statistics felt a bit overwhelming, especially if you came to it—as I once did—with little prior knowledge of the subject. We have tried to grapple with two massive fields of mathematics, all in the space of a chapter, and connect them to machine learning. I’d be surprised if it didn’t feel a touch too much. Even if everything else becomes a blur, we can take away from this chapter a few succinct conceptual messages.

在监督式机器学习中,所有数据都是从底层分布中提取(或采样)的。D是我们的数据。D的一部分是特征向量矩阵X。每行代表数据的一个实例(例如,一只企鹅的特征x)。数据D还为X的每一行(例如,企鹅的种类)提供了相应的标签。这些标签形成一个列向量y。数据D被认为是从底层分布 P ( X, y ) 中采样的。因此:

In supervised machine learning, all data is drawn (or sampled) from an underlying distribution. D is our data. One part of D is a matrix of feature vectors, X. Each row represents one instance of the data (say, features for one penguin, x). The data D also has a corresponding label for each row of X (say, the species of penguin). These labels form a column vector y. The data D is said to be sampled from the underlying distribution P (X, y). So:

D ~ PX,y

D ~ P(X, y)

总体基础分布的真实性质几乎总是隐藏在我们面前。许多 ML 算法的任务是尽可能隐式或显式地估计此分布,然后使用该分布对新数据进行预测。

The true nature of the overall underlying distribution is almost always hidden from us. The task of many ML algorithms is to estimate this distribution, implicitly or explicitly, as well as possible and then use that to make predictions about new data.

假设估计分布如下:

Let’s say the estimated distribution is given by:

X,y

Pθ(X, y)

符号θ表示分布的参数。对于不同类型的分布,参数的含义不同。例如:伯努利分布需要确定一个参数“ p ”;正态分布需要确定平均值和标准差。(我忽略了一类没有参数的分布,因此被称为非参数分布。)该过程首先对基础分布的类型做出假设 — — 例如,它是伯努利分布、正态分布还是其他分布?— — 然后确定最佳的θ

The symbol θ represents the parameters of the distribution. The parameters mean different things for different types of distributions. For example: A Bernoulli distribution requires figuring out one parameter “p”; a normal distribution requires figuring out the mean and standard deviation. (I’m ignoring a whole class of distributions that don’t have parameters and, hence, are called nonparametric.) The process starts with making an assumption about the type of underlying distribution—is it, for example, a Bernoulli or a normal distribution or something else?—and then figuring the best θ.

广义上讲,估计θ可以用两种方法之一完成。(还有其他方法,但这两种方法让我们很好地了解了机器学习的工作原理。)第一种方法称为最大似然估计 (MLE),给定数据,尝试找到使数据似然最大化的θ 。这意味着P θ ( X, y ) 将为不同的θ提供不同的概率分布,并且算法会找到使观察我们手头数据的概率最大化的θ

Estimating θ can be done, broadly speaking, in one of two ways. (There are other methods, but these two give us a great sense of how things are done in machine learning.) The first is called maximum likelihood estimation (MLE), which, given the data, tries to find the θ that maximizes the likelihood of the data. What that means is that Pθ(X, y) will give us different probability distributions for different θ, and the algorithm finds the θ that maximizes the probability of observing the data we have in hand.

估计概率分布的第二种方法称为最大后验 (MAP) 估计。该方法假设θ本身是一个随机变量,这意味着您可以为其指定一个概率分布。(如前所述,这是贝叶斯学说,它让频率论者大为恼火。)因此,MAP 从关于θ分布方式的初始假设开始。这也称为先验。例如,如果您正在对抛硬币进行建模,则可以先验假设硬币是公平的;或者您可以假设它有偏差。然后,给定数据和先验,MAP 会找到后验概率分布P θ ( X, y ),这样,如果您要从此分布中抽取多个数据实例,则抽样数据与原始数据匹配的概率将最大化。

The second method of estimating the probability distribution is called maximum a posteriori (MAP) estimation. This assumes that θ is itself a random variable, which means that you can specify a probability distribution for it. (As we saw earlier, it’s a Bayesian claim that makes frequentists livid.) So, MAP starts with an initial assumption about how θ is distributed. This is also called the prior. For example, if you are modeling a coin toss, you can assume a priori that the coin is fair; or you can assume that it’s biased. Then, given the data and the prior, MAP finds the posterior probability distribution Pθ(X, y), such that if you were to sample several instances of data from this distribution, the probability that the sampled data matched the original data would be maximized.

如果我们能够以某种方式学习或估计整个联合概率分布P θ ( X, y ),即超维空间中的某个复杂曲面,我们将拥有所有数据、特征向量和标签的模型。这使我们能够做一些非常强大的事情:通过从分布中采样来生成类似于训练数据的新数据,从而为我们提供了所谓的生成式人工智能。

If we can somehow learn or estimate the entire joint probability distribution Pθ(X, y), which is some complex surface in a hyperdimensional space, we will have a model of all the data, the feature vectors, and the labels. This enables us to do something very powerful: generate new data that resemble the training data by sampling from the distribution, giving us what has come to be called generative AI.

ML 算法还可以使用该模型对新的未标记数据进行预测。朴素贝叶斯(或傻瓜贝叶斯)分类器就是一个例子:它首先学习联合概率分布(尽管有一些简化假设),然后使用贝叶斯定理区分不同类别的数据。

ML algorithms can also use the model to make predictions about new, unlabeled data. The naïve Bayes (or idiot Bayes) classifier is an example: It first learns the joint probability distribution, albeit with some simplifying assumptions, and then uses Bayes’s theorem to discriminate between different classes of data.

有些算法避免学习复杂的联合概率分布;相反,它们专注于属于一个类或另一个类的数据的条件概率。这种方法允许人们做一些称为判别学习的事情。算法可以通过计算概率分布P θ ( y | x ) 来进行判别学习。这意味着给定一个新的特征向量x和一些最优θ,我们现在可以计算x最可能类别的概率:具有较高条件概率的类别获得认可,作为我们的 ML 算法的预测。

There are algorithms that eschew learning the complex joint probability distribution; instead, they focus on the conditional probabilities of the data belonging to one class or the other. This approach allows one to do something called discriminative learning. An algorithm can do discriminative learning by calculating the probability distribution Pθ(y|x). This means that given a new feature vector x and some optimal θ, we now can calculate the probability of the most likely class for x: The class that has the higher conditional probability gets the nod, as the prediction of our ML algorithm.

任何通过确定数据点之间的边界来区分一组数据点和另一组数据点的算法都是在进行判别学习。它不必专门处理概率分布。例如,它可能会找到一个线性超平面,就像感知器一样,或者找到一个弯曲的非线性表面或边界,我们将在下一章中看到一个例子。后一种算法的一个例子——其根源可能在于第一批人类所怀有的直觉——始于 20 世纪 60 年代的斯坦福大学(Bernard Widrow 开发 ADALINE 几年后)。它后来被称为最近邻 (NN) 算法,它展示了一种非常不同的模式识别方式。感知器的收敛证明让人们刮目相看,NN 算法也取得了同样的效果,在理想情况下,它的表现几乎与贝叶斯最优分类器一样好,我们现在知道贝叶斯最优分类器是城里最好的 ML 游戏。但 NN 算法这样做时没有对数据的底层分布做出任何假设。

Any algorithm that figures out how to separate one cluster of data points from another by identifying a boundary between them is doing discriminative learning. It doesn’t have to deal specifically in probability distributions. For example, maybe it finds a linear hyperplane à la the perceptron, or finds a curved nonlinear surface or boundary, an example of which we’ll see in the next chapter. An example of this latter algorithm—whose roots probably lie in intuitions harbored by the first humans—got its start at Stanford in the 1960s (a few years after Bernard Widrow developed ADALINE). It came to be called the nearest neighbor (NN) algorithm, and it showed a very different way to do pattern recognition. And in much the way the perceptron’s convergence proof made people sit up and take notice, the NN algorithm achieved the same, doing almost as well, in ideal scenarios, as the Bayes optimal classifier, which we now know is the best ML game in town. But the NN algorithm does so without making any assumptions about the underlying distribution of the data.

第五章

CHAPTER 5

志趣相投的人

Birds of a Feather

霍乱地区的每条街道上都死人不偿命。”这只是霍乱调查委员会 1855 年 7 月提交的一份报告中众多严肃、发人深省的句子之一,该报告讲述了去年伦敦教区爆发的一场特别严重的霍乱疫情。疫情集中在伦敦西区的苏豪区,后来被称为“霍乱区”。“在该地区的中心地带 Broad Street,死亡人数超过 10%,即每 10,000 人中就有 1,000 人死亡。在 Cambridge Street、Pulteney Court 和 Kemp's Court,人口也大幅减少,”委员会成员写道。

“No street in the Cholera area was without death.” This is just one of the many stark, sobering sentences found in a report submitted by the Cholera Inquiry Committee in July 1855, about a particularly severe outbreak of the disease that had struck a London parish the previous year. The outbreak was concentrated in what came to be called the “Cholera area” of Soho, in the city’s West End. “In Broad Street, the very heart of the area, the deaths were rather more than 10 per cent, or 1,000 to every 10,000 persons living. In Cambridge Street, Pulteney Court, and Kemp’s Court, the population was also decimated,” the members of the committee wrote.

其中一名委员会成员是一位名叫约翰·斯诺的医生,他为医学科学的两个领域——麻醉学和流行病学做出了重大贡献。如今,麻醉师们崇敬斯诺,因为他对乙醚和氯仿进行了科学研究。他的方法使他能够给病人注射“维多利亚女王在利奥波德王子和比阿特丽斯公主出生时使用了氯仿麻醉剂”,从而“违背宗教、伦理和医学信仰,接受了产科麻醉”。流行病学家赞扬斯诺对 1854 年霍乱疫情的出色分析。他能够证明疫情集中在布罗德街的一个水泵周围,这为他的霍乱是一种水传播疾病的假设提供了依据。斯诺的流行病学实地调查迫使当局检查了水泵及其周围环境,最终他们发现,距离 Broad Street 水井不到几英尺的地方有一个砖砌结构腐烂的化粪池,污水从化粪池中渗漏到水井周围的土壤中。水泵就是从那口井中抽水的。

One of the committee members was a physician named John Snow, who had been making major contributions to two fields of medical science, anesthesiology and epidemiology. Today, anesthesiologists revere Snow for his scientific study of ether and chloroform. His methods allowed him to administer “chloroform to Queen Victoria for the births of Prince Leopold and Princess Beatrice,” bringing “obstetric anesthesia into acceptance against religious, ethical, and medical beliefs.” And epidemiologists celebrate Snow for his extraordinary analysis of the 1854 cholera outbreak. He was able to show that the outbreak was clustered around a water pump on Broad Street, which lent credence to his hypothesis that cholera was a waterborne disease. Snow’s epidemiological legwork forced authorities to inspect the pump and its surroundings, and they eventually established that a cesspool with decaying brickwork, located within a few feet of the Broad Street well, was leaking wastewater into the soil around the well. The pump was drawing water from that well.

霍乱调查委员会报告包含斯诺撰写的部分内容,其中包括一张带注释的苏豪区“霍乱地区”地图,该地图被流行病学家视为经典。但最近,这张地图引起了计算机科学家的注意,因为斯诺使用的技术是对流行且强大的机器学习算法概念核心的有力例证。

The Cholera Inquiry Committee report contains a section written by Snow, including an annotated map of Soho’s “Cholera area” that’s considered a classic by epidemiologists. But more recently, the map caught the attention of computer scientists, because the technique Snow used is a striking illustration of something that forms the conceptual heart of a popular and powerful ML algorithm.

斯诺的地图有几个关键要素。首先是一条虚线,它划定了苏豪区的一个区域;1854 年 8 月和 9 月的六周内,所有死于霍乱的病例都发生在这个区域。每个死亡病例都用一个小的黑色矩形表示,标记出该人死亡或感染该疾病的房屋地址。(有些房屋有多个这样的标记。)水泵被画成小黑点。最重要的是,斯诺画了一条内部黑色虚线,表示“经过仔细测量,发现各个点与 Broad Street 的泵和周围的泵之间的距离相等。”换句话说,如果你站在这条虚线的任何地方,只要你沿着街道和道路走到泵(所以不是直线距离),你就会与疾病肆虐的 Broad Street 泵和 Soho 的其他泵等距。当在后一条虚线的周界内时,人们离 Broad Street 泵更近;在虚线外,他们离其他泵更近。

Snow’s map had a few key elements. First was a dotted line that circumscribed a region in Soho; all deaths due to cholera during six weeks in August and September 1854 occurred in this region. Each death was denoted as a small black rectangle, marking the address of the house in which the person had died or contracted the disease. (Some houses had multiple such markers.) Water pumps were drawn as small black dots. Most important, Snow drew an inner black dotted line that showed “the various points which have been found by careful measurement to be at an equal distance by the nearest road from the pump in Broad Street and the surrounding pumps.” In other words, if you stood anywhere on this inner dotted line, you’d be equidistant from the location of the disease-ridden Broad Street pump and some other pump in Soho, as long as you walked along streets and roads to get to the pumps (so, not as the crow flies). When inside the perimeter of this latter dotted line, people were closer to the Broad Street pump; outside, they were closer to some other pump.

在考虑到一个令人困惑的泵之后,斯诺总结道,“可以观察到,在距离另一个泵比距离 Broad Street 的泵更近的地方,死亡人数会大大减少,甚至完全停止。”距离其他水泵较近,因此去那里取水的人比去 Broad Street 水泵取水的人更方便。Broad Street 水泵就是问题所在。

After making allowances for one confounding pump, Snow concluded, “It will be observed that the deaths very much diminish, or cease altogether, at every point where it becomes decidedly nearer to send to another pump than to the one in Broad Street.” People who were nearer to some other pump and who went there for water, rather than to the Broad Street pump, were better off. The Broad Street pump was the problem.

斯诺的内虚线代表了现代用语中所谓的Voronoi 单元,以乌克兰数学家 Georgy Voronoi(或 Voronoy,这是该名字的拼写)命名,他在 Snow 的分析几十年后为同名图表开发了正式数学。想象一下 Soho 的 2D 地图,上面散布着一些水泵,每个水泵都用一个小黑点表示。您可以在每个点或“种子”周围绘制一个 Voronoi 单元,使得单元内的任何点都比任何其他种子更接近其种子。因此,对于为 19 世纪中期 Soho 地图绘制的 Voronoi 图,每个单元都有一个水泵作为其种子,并且每个单元内的每个位置都比任何其他泵更接近其“种子”水泵。有多种方法来测量单元内某个位置与水泵的接近程度。通常,您会使用直线,即“直线”距离测量。但在这种情况下,Snow 必须更聪明。必须根据沿街行走到泵的距离来判断最近的泵。

Snow’s inner dotted line represents the contours of what in modern parlance is called a Voronoi cell, after the Ukrainian mathematician Georgy Voronoi (or Voronoy, as the name is also spelled), who developed the formal mathematics for the eponymous diagrams decades after Snow’s analysis. Think of the 2D map of Soho, with a sprinkling of water pumps, each pump represented by a small black dot. You can draw a Voronoi cell around each dot, or “seed,” such that any point inside the cell is closer to its seed than to any other seed. So, for a Voronoi diagram drawn for a mid-nineteenth-century map of Soho, each cell has a water pump as its seed, and every location within each cell is closer to its “seed” water pump than to any other pump. There are different ways to measure closeness from someplace inside the cell to the pumps. Normally, you’d use straight lines, an “as the crow flies” measure of distance. But in this case, Snow had to be smarter. The closest pump had to be discerned based on the distance covered to the pumps when one walked along the streets.

以下是简单 Voronoi 图的示例:

Here’s an example of a simple Voronoi diagram:

该图是由不规则多边形或单元格组成的镶嵌图。将每个点视为一个水泵。现在,让我们坚持使用“直线距离”度量来测量到水泵的距离。在上面的 Voronoi 图中,如果我们在一个单元格内,最近的水泵就是单元格内的水泵。如果我们沿着多边形的边缘行走,并且由于该边缘由两个单元格共享,因此我们与位于这些单元格内的两个水泵的距离相等。如果我们站在由边缘相交形成的顶点,我们与三个(或更多)水泵的距离相等。

The diagram is a tessellation of irregular polygons, or cells. Think of each dot as a water pump. For now, let’s stick to the “as the crow flies” metric to measure the distance to a pump. In the Voronoi diagram above, if we are within a cell, the closest pump is the one inside the cell. If we are walking along a polygon edge, and because that edge is shared by two cells, we are equidistant from the two pumps that lie within those cells. And if we are standing at the vertex formed by the intersection of edges, we are equidistant from three (or more) pumps.

这与学习型机器有什么关系?关系很大。我们先来假设一个问题。想象一下曼哈顿中城,那里的街道和大道大多是整齐的网格状。

What does this have to do with machines that learn? Quite a lot. Let’s start with a hypothetical problem. Imagine Midtown Manhattan, with its mostly neat grid of streets and avenues.

假设美国邮政局在这个街区设立了六个崭新的分支机构。我们的任务是将中城的每栋建筑分配给最近的新邮局分支机构。(忽略建筑物有邮政编码的事实;使用这样的代码会使(这在将这项任务简化后就变得很简单了。)我们该怎么做呢?我们可以在 xy 平面上为整个中城绘制一个 Voronoi 图(用邮局分支机构代替水泵)。在上图中,点 (0, 0) 是左下角,位于第二十三街和第十一大道的交叉口。每个邮局分支机构都会获得一个相对于 (0, 0) 的 (x , y) 坐标,我们可以基于此构建 Voronoi 图。一旦我们有了 Voronoi 图,将建筑物分配给邮局分支机构就变得很简单了。如果给定的建筑物位于 Voronoi 单元内,则由作为该单元种子的邮局分支机构为其提供服务。如果建筑物位于两个单元的边缘或顶点上,则它与多个邮局分支机构等距,因此我们将其分配给处理建筑物数量最少的分支机构。

Let’s say the U.S. Postal Service installs six brand-spanking-new branches in this neighborhood. We are given the task of assigning each building in Midtown to the nearest new post office branch. (Ignore the fact that buildings have postcodes; using such codes makes the task trivial.) How would we do it? We could come up with a Voronoi diagram for all of Midtown on the xy plane (substituting post office branches for water pumps). In the map shown above, the point (0, 0) is the bottom-left corner, at the intersection of Twenty-Third Street and Eleventh Avenue. Each post office branch gets an (x, y) coordinate relative to (0, 0), based on which we can construct a Voronoi diagram. Once we have the Voronoi diagram, the assignment of buildings to post office branches becomes straightforward. If a given building lies within a Voronoi cell, it’s served by the post office branch that is the seed for that cell. If the building is on the edge of two cells or on a vertex, it’s equidistant from multiple post office branches, so we assign it to the branch that’s handling the least number of buildings.

但是,从建筑物到邮局的距离的正确概念是什么?假设我们使用“直线距离”测量距离,也称为欧几里得距离(以希腊数学家欧几里得命名)。如果邮局分行位于坐标 ( x1, y1 ),公寓大楼位于坐标 ( x2, y2 ),则欧几里得距离由以下公式给出:

But what’s the correct notion of distance from a building to a post office? Let’s assume we use the “as the crow flies” measure of distance, also called the Euclidean distance (after the Greek mathematician Euclid). If the post office branch is at coordinate (x1, y1) and the apartment building is at (x2, y2), then the Euclidean distance is given by:

这只是一个直角三角形斜边的长度,其坐标为 ( x1, y1 ) 和 ( x2, y2 ) ,构成三角形的两个顶点。三角形有两条相互垂直的边。一条边沿 x 方向的长度为 ( x2 - x1 ),另一条边沿 y 方向的长度为 ( y2 - y1 )。计算斜边的长度很简单(见对面的图)。

This is simply the length of the hypotenuse formed by a right-angled triangle, with the coordinates (x1, y1) and (x2, y2) forming two vertices of the triangle. The triangle has two sides that are perpendicular to each other. One side, along the x-direction, has the length (x2-x1) and the other, in the y-direction, has the length (y2-y1). It’s trivial to calculate the length of the hypotenuse (see figure opposite).

但事实证明,在曼哈顿中城这样的地方,这并不是一个好的距离测量方法。人们不是乌鸦,他们想要到最近的邮局分行的距离,以人行道/街道行驶的距离来衡量。嗯,有一个简单的距离测量方法:

But it turns out that this is not a good measure of distance in a place like Midtown Manhattan. People aren’t crows, and they want the distance to the nearest post office branch measured in terms of the distance traveled by sidewalk/street. Well, there’s a simple measure of such a distance:

x 2 - x 1)+(y 2 - y 1)

(x2-x1)+(y2 - y1)

由于曼哈顿中城的网格(忽略横穿网格的百老汇),这相当于沿街道行走( x2 - x1 )个长度单位,沿大道行走( y2 - y1)个长度单位,即使您必须沿着不同的街道和大道曲折前行。难怪这种距离测量正式称为两点之间的曼哈顿距离。现在,如果有人在空地上建造新建筑,那就是一个新的数据点。找出该建筑的邮局分支很简单:找到该建筑的 Voronoi 单元并将该建筑分配给该单元的种子或邮局分支。

Because of Midtown Manhattan’s grid (ignoring Broadway, which cuts across the grid), this is the same as walking (x2-x1) units of length along streets and (y2-y1) units of length along avenues, even if you have to zigzag your way down different streets and avenues. No wonder this measure of distance is formally called the Manhattan distance between two points. Now, if someone constructs a new building on vacant land, that’s a new data point. It’s simple to figure out that building’s post office branch: Find the building’s Voronoi cell and assign that building to the cell’s seed, or post office branch.

我们刚刚分析的问题更一般地被定义为寻找最近邻居。这种搜索的软件实现在机器学习中最具影响力的算法。我们很快就会明白为什么。但首先,我们必须回到伊斯兰黄金时代,回到阿布·阿里·哈桑·伊本·海赛姆(Abu Ali al-Hasan Ibn al-Haytham,或称 Alhazen)的工作,他是一位穆斯林阿拉伯数学家、天文学家,和物理学家。正是阿尔海森在试图解释视觉感知时,提出了一种与现代最近邻搜索算法非常相似的技术。意大利威尼斯大学的计算机科学家马塞洛·佩利洛一直在尽力让人们关注阿尔海森的思想。

The problem we’ve just analyzed is more generally cast as the search for nearest neighbors. Software implementations of such searches rank among the most influential algorithms in machine learning. We’ll soon see why. But first, we must go back in time to the Islamic Golden Age and the work of Abu Ali al-Hasan Ibn al-Haytham, or Alhazen, a Muslim Arab mathematician, astronomer, and physicist. It was Alhazen who, in his attempt to explain visual perception, came up with a technique that closely mirrors modern nearest neighbor search algorithms. Marcello Pelillo, a computer scientist at the University of Venice, Italy, has been doing his best to draw attention to Alhazen’s ideas.

算法的形成

THE MAKINGS OF AN ALGORITHM

一天,佩利洛在康涅狄格州纽黑文的一家书店闲逛时,偶然发现了一本薄书,名为《从阿尔·肯迪到开普勒的视觉理论》。那是 20 世纪 90 年代末,当时佩利洛是耶鲁大学的客座教授。除了从事计算机视觉、模式识别和机器学习方面的研究外,他还对科学史和科学哲学以及数学情有独钟。这本薄书只有两百多页,非常吸引人。书中认为阿尔海森是“他是古代至十七世纪光学史上最重要的人物。”在阿尔海森之前,人类对视觉的理解——即我们观察和感知周围世界的能力——现在看来非常奇怪。有一种观点被称为“内射”理论,该理论本质上认为,我们之所以能看到一个物体,是因为某种形式的物质从该物体中散发出来并进入我们的眼睛:“物质复制品从可见物体向四面八方发出,进入观察者的眼睛,产生视觉。”有些人认为那些物质碎片就是原子。”这一理论的本质特征是,从特定物体向各个方向流出的原子形成连贯的单位——薄膜或模拟物——将物体的形状和颜色传达给观察者的灵魂;就灵魂而言,遇到物体的模拟物就相当于遇到物体本身。”

One day, when he wandered into a bookstore in New Haven, Connecticut, Pelillo stumbled upon a slim book called Theories of Vision from Al-Kindi to Kepler. It was the late 1990s, and Pelillo was then a visiting professor at Yale. Besides doing research in computer vision, pattern recognition, and machine learning, he had a penchant for the history and philosophy of science and a love of math. The slim book, at just over two hundred pages, was alluring. It argued that Alhazen was “the most significant figure in the history of optics between antiquity and the seventeenth century.” Before Alhazen, humanity’s attempts at understanding vision—our ability to see and perceive the world around us—were, in hindsight, very strange. One idea was known as the “intromission” theory, which essentially posited that we see an object because bits of matter of some form emanate from that object and enter our eyes: “Material replicas issue in all directions from visible bodies and enter the eye of an observer to produce visual sensation.” Some believed that those bits of matter were atoms. “The essential feature of this theory is that the atoms streaming in various directions from a particular object form coherent units—films or simulacra—which communicate the shape and color of the object to the soul of an observer; encountering the simulacrum of an object is, as far as the soul is concerned, equivalent to encountering the object itself.”

到那时,另一种影响力较小的理论是“外射”理论,该理论认为,我们的眼睛发出的光线会拦截物体,使我们能够看到这些物体。大约公元前 300 年,欧几里得首次给出完全几何的解释,他用七条定理解释了这些光线如何解释我们视觉的各个方面。后来,有人试图将内射和外射理论结合起来,形成一个统一的视觉解释,但没有成功。

Less influential by this time was an alternative idea, under the rubric of “extramission” theories, which argued that our eyes emanate rays that intercept objects, allowing us to see those objects. Euclid, back in about 300 B.C., gave the first entirely geometrical account, using seven theorems of how such rays could explain aspects of our vision. Then there were those who tried, unsuccessfully, to combine the intromission and extramission theories into a cohesive explanation of vision.

所有这些显然都是错误的,但同样,这只是事后才发现的。阿尔海森提出了另一种理论,终结了这些想法。他提出了一种新型的视觉插入方式,这种方式不依赖于离开物体的物质进入我们的眼睛,而是需要光线从有色物体的每个点以直线辐射出来。其中一些光线会进入我们的眼睛并导致感知。我们不必关心阿尔海森对眼睛解剖结构或所有光学原理的惊人分析。可以说,他的工作在“将插入理论转化为视觉的数学理论。”

All this was patently wrong, but, again, only in hindsight. Alhazen came up with an alternative theory that put paid to these ideas. He proposed a new type of intromission, one that relied not on bits of matter leaving an object and entering our eyes but, rather, one that required light radiating out in straight lines from every point of a colored object. Some of that light would enter our eyes and lead to perception. We don’t have to concern ourselves with Alhazen’s amazing analysis of the eye’s anatomy or the optics of it all. Suffice it to say his work was instrumental in “transforming the intromission theory into a mathematical theory of vision.”

马塞洛·佩利洛最感兴趣的是阿尔海森对光线和颜色进入眼睛后会发生什么的描述:即识别眼睛所看到的东西的行为。阿尔海森写道:“当视觉感知到某个可见物体时,辨别能力就会立即在想象中存在的形式中寻找其对应物,当它在想象中找到与该可见物体形式相似的形式时,它就会认出该可见物体,并感知到它是什么样的东西。”

Marcello Pelillo was most interested in Alhazen’s account of what happens once the light and color are registered in the eye: the act of recognizing what the eye is seeing. Alhazen wrote, “When sight perceives some visible object, the faculty of discrimination immediately seeks its counterpart among the forms persisting in the imagination, and when it finds some form in the imagination that is like the form of that visible object, it will recognize that visible object and will perceive what kind of thing it is.”

基本上,阿尔海森认为,一旦一个可见物体进入眼睛,某种认知过程——“辨别能力”——就会将所见之物与想象或记忆中已经看到并分类的物体进行比较。因此,如果所见之物是一只狗,当认知将图像与狗的某些存储记忆联系起来时,就会识别出该图像。

Basically, Alhazen argued that once a visible object registered in the eye, some cognitive process—the “faculty of discrimination”—compared what was being seen to what had already been seen and categorized in one’s imagination, or memory. So, if what’s being seen is a dog, that image is recognized as such when cognition connects it to some stored memory of a dog.

阿尔海森甚至还提出了一个解决方案,用于当无法进行这样的比较时,因为之前从未见过该物体。“如果它无法在想象中残存的形式中找到与可见物体形式相似的形式,它就无法识别该可见物体或感知到它是什么样的东西,”他写道。

Alhazen even had a solution for when no such comparison was possible because the object had never been seen before. “If it does not find a form similar to the form of that visible object among the forms persisting in the imagination, it will not recognize that visible object or perceive what kind of thing it is,” he wrote.

请注意,Alhazen 说过一种形式与另一种形式“相似”,但他没有说“相似性”是什么意思。在计算机科学中,相似性与某个超维空间中一个数据点与另一个数据点的距离有关,无论是欧几里得距离、曼哈顿距离还是其他距离度量。(根据给定的度量,两个数据点彼此越近,它们就越相似。)我们很快就会回到这些问题上。

Note that Alhazen talked of one form being “similar” to another, but he didn’t say what “similarity” meant. In computer science, similarity has to do, for example, with the distance of one data point from another in some hyperdimensional space, be it Euclidean, Manhattan, or some other measure of distance. (The closer that two data points are to each other according to some given measure, the more similar they are.) We’ll get back to these issues soon enough.

当佩利洛看到阿尔海森的作品时,他清楚地认识到阿尔海森的方法是“一个令人惊讶的清晰、几乎是算法式的阐述”,这个概念正式诞生于 20 世纪 50 年代,并在 60 年代得到了数学分析,这主要归功于斯坦福大学年轻、机智的信息理论家和电气工程师 Thomas Cover 以及 Cover 早熟的研究生 Peter Hart。他们的算法被称为最近邻 (NN) 规则;它成为一种极其重要的模式识别算法,可将数据归类为属于某一类别。(所看到的是狗还是猫?)

When Pelillo looked at Alhazen’s writing, it became clear to him that Alhazen’s method was “a surprisingly clear, almost algorithmic, exposition” of an idea that was formally invented in the 1950s and analyzed mathematically in the ’60s, thanks mainly to Thomas Cover, a young, whip-smart information theorist and electrical engineer at Stanford, and Cover’s precocious graduate student Peter Hart. Their algorithm came to be called the nearest neighbor (NN) rule; it became an extremely important algorithm for pattern recognition, which classifies data as belonging to one category or another. (Is what’s being seen a dog or a cat?)

我不知道 [阿尔海森的作品] 是否是第一个提出这个想法的人,也就是说,为了识别一个物体,我必须将该物体与我记忆中的某个物体进行比较,然后根据某种相似性概念寻找最接近的物体,”佩利洛告诉我。“[那] 正是最近邻规则。我猜想这可能是第一次。”

I don’t know whether [Alhazen’s work] was the very first time that that idea was put forward, namely, that in order for me to recognize an object, I have to compare that object with something that I have in my memory, and then I just look for the closest one according to a certain notion of similarity,” Pelillo told me. “[That] is exactly the nearest neighbor rule. My conjecture is that probably it was the first time.”

模式、向量和邻居

PATTERNS, VECTORS, AND NEIGHBORS

早在 Cover 和 Hart 的 NN 算法诞生一千年前,Alhazen 就已经清楚地预见到了他们的一些方面,但 Peter Hart 认为这种直觉可以追溯到更早,可能可以追溯到我们居住在洞穴中的先辈。为了阐明他的观点,他从一些简单的数学开始,就是我们已经遇到过的那种。想象一下 2D 向量和 xy 平面。xy 平面上由某个点 (x, y) 给出的每个点都是一个向量,是从原点 (0, 0) 到 (x, y) 的箭头。3D xyz 坐标系也是如此:每个向量都是从 (0, 0, 0) 到 (x, y, z) 的箭头。类似地,3D 向量只是 3D 空间中的一个点。这可以扩展到任意数量的维度。我们很快就会用完字母表中的字母,因此习惯上用 [ x1 , x2 , x3 , , xd ] 来指代向量,表示 d 维向量。现在的关键是将矢量与图案连接起来。

Alhazen had clearly envisioned aspects of Cover and Hart’s NN algorithm almost a thousand years before them, but Peter Hart thinks the intuition goes back farther, probably to our cave-dwelling forebears. To make his point, he starts with some simple mathematics, of the kind we have already encountered. Think of 2D vectors and the xy plane. Each point in the xy plane, given by some point (x, y) is a vector, an arrow drawn from the origin (0, 0) to (x, y). The same is true of the 3D xyz coordinate system: Each vector is an arrow drawn from (0, 0, 0) to (x, y, z). Similarly, a 3D vector is simply a point in 3D space. This can be extended to any number of dimensions. We run out of letters of the alphabet soon, so it’s customary to refer to vectors using [x1, x2, x3,, xd], for a d-dimensional vector. The key now is to connect vectors to patterns.

假设我们正在看一张 7×9 的图像,总共有 63 个像素,每个像素可以是白色 (0) 或黑色 (1)。您可以使用这样的图像轻松表示数字 0 到 9,方法是将一些像素设为黑色,将其他像素设为白色。并且每个图像或图案都可以写成一个向量 [ x1 , x2 , …x63 ]:一组 63 个数字,每个数字要么是 0,要么是 1。我们只是将每个 7×9 图像转换为一个 63 维向量。

Let’s say we are looking at a 7×9 image, a total of 63 pixels, where each pixel can be either white (0) or black (1). You can easily depict the digits 0 to 9 using such an image by making some pixels black and others white. And each image, or pattern, can be written as a vector, [x1, x2,…x63]: a set of 63 numbers, where each number is either a 0 or a 1. We have simply transformed each 7×9 image into a 63-dimensional vector.

现在,如果平板电脑的触摸屏上显示一个空白的 7×9 网格,并要求您用手指画出数字 2 或数字 8,结果会怎样?每次您画出数字时,网格中的某些方块会变黑,而其他方块则保持白色。然后,该模式将存储为一个 63 位长的数字。您重复几次,然后将平板电脑传给其他人,他们也会重复。许多人轮流生成数据,很快,您就会得到数百个向量样本,这些样本代表两个手绘数字 2 和 8。从数学上讲,每个模式都是 63 维向量空间中的一个点。问问自己:假设每个向量都是一个点(机器学习就是这样看待向量的),那么这些向量将如何在该超维空间中聚集?

Now, what if you were shown a blank 7×9 grid on the touch screen of your tablet and were asked to draw the numeral 2 or the numeral 8 using your finger? Each time you draw the numeral, some squares in the grid turn black, while others remain white. The pattern then gets stored as a number that is 63 bits long. You do that a few times and then pass the tablet to someone else, and they do the same. Many people take turns at generating the data, and soon, you have hundreds of samples of vectors representing two hand-drawn digits, 2 and 8. Each pattern, mathematically, is a point in the 63-dimensional vector space. Ask yourself this: Given that each vector is a point (which is how machine learning thinks of vectors), how would these vectors be clustered in that hyperdimensional space?

好吧,我们大多数人都会以大致相同的方式绘制数字 2,但即使这些画是由同一个人绘制的,每次绘制的结果也可能略有不同。(因为我们使用的是黑白像素,所以画作之间的差异可能并不大,但如果手指压力将每个像素转换为灰度上的某个值并赋予它 0 到 255 之间的数字,差异就会大得多。但我们用 0 和 1 表示白色和黑色。)因此,每次将模式 2 存储为一组 63 个数字时,它都表示一个向量,一个与表示其他手绘 2 的向量大致位于相同位置的点。现在,表示所有手绘 2 的点将在 63 维空间中彼此靠近聚集。数字 8 也是如此。但是这两个数字有足够的不同,以至于构成数字 2 的簇的向量将位于 63D 空间的一个区域,而数字 8 的向量簇将位于其他地方。

Well, most of us will draw the numeral 2 in much the same way, but each effort might be slightly different from another, even if the drawings were made by the same person. (Because we are using black and white pixels, the difference may not be significant between drawings, but it would be a lot more if, say, finger pressure converted each pixel into something on the grayscale and gave it a number from 0 to 255. But let’s go with 0 and 1 for white and black.) So, each time the pattern 2 is stored as a set of 63 numbers, it represents a vector, a point that’s more or less in the same location as the vectors representing other hand-drawn 2s. Now, the points representing all the hand-drawn 2s will be clustered near one another in the 63-dimensional space. The same goes for the numeral 8. But the two numerals are different enough that the vectors constituting the cluster for the numeral 2 will be in one region of the 63D space, and the cluster of vectors for the numeral 8 will be somewhere else.

假设触摸屏还为每个模式生成一个标签,将其标记为 2 或 8。我们现在所做的是生成一个样本数据集,其中每个向量都有一个相关标签。现在我们(或者更确切地说,任何 ML 算法)面临的问题是:给定一个对于新的未标记或未加标签的模式,算法能否分辨出它是 2 还是 8?

Assume that the touch screen also generates a label for each pattern, tagging it as either 2 or 8. What we have done now is generate a sample dataset where each vector has an associated label. The question now for us—or, rather, for any ML algorithm—is this: Given a new unlabeled, or untagged, pattern, can the algorithm tell whether it’s a 2 or an 8?

从算法上讲,我们可以做一些非常简单的事情。只需将新的未标记向量绘制为一个点。在 63D 空间中找到离它最近的点。如果最近的点的标签为 2,那么新点很可能也是 2。如果最近的邻居是 8,那么新进入者很可能也是 8。“这就是最近邻规则。这就是穴居人的直觉:如果它们看起来相似,那么它们可能就是相似的,”哈特告诉我。

Algorithmically, there’s something very simple one can do. Just plot the new, unlabeled vector as a point. Find the point that’s nearest to it in the 63D space. If the nearest point has the label 2, then the new point is most likely a 2, too. If the nearest neighbor is an 8, the new entrant is most likely an 8. “That’s the nearest neighbor rule. That’s the caveman intuition: If they look alike, they probably are alike,” Hart told me.

尽管有穴居人和阿尔海森的开创性工作,但对最近邻规则的首次数学提及出现在 1951 年美国空军航空医学院(位于德克萨斯州兰道夫场)的一份技术报告中。作者是伊芙琳·菲克斯和小约瑟夫·L·霍奇斯。1940 年,菲克斯来到加州大学伯克利分校,担任统计实验室的助理研究员,负责国防研究委员会的一个项目。当时,美国研究人员正被卷入欧洲的战火之中。战争年代很艰难,”霍奇斯和其他人在悼念菲克斯的悼词中写道。

Cave dwellers and Alhazen’s seminal work notwithstanding, the first mathematical mention of the nearest neighbor rule appeared in a 1951 technical report of the U.S. Air Force School of Aviation Medicine, Randolph Field, Texas. The authors were Evelyn Fix and Joseph L. Hodges, Jr. In 1940, Fix came to work at the University of California, Berkeley, as a research assistant in the Statistical Laboratory, assigned to a project for the National Defense Research Committee. U.S. researchers were getting drawn into the war raging in Europe. “The war years were hard,” Hodges and others wrote in a memorial tribute to Fix.

他们带来了困难但偶尔有趣的问题。实验室的每一个要求都强调紧迫性;问题的实际解决方案意味着确定最佳计划、该计划的成功概率、该计划的成功概率、数字和更多的数字。高速计算机不存在;所有的数字工作都必须在台式计算器上完成,这耗费了大量的时间和精力。伊芙琳 [Fix] 精力充沛,具有完成工作并做好工作的特殊精神,她日夜在机器前工作,一群学生和一些教职工的妻子也帮忙,以便及时将所需的结果传送到纽约,但 有时直接去英国。这些年里,伊芙琳还继续自己的学业并给学生讲课。
They brought difficult, but occasionally interesting problems. Every demand on the Laboratory emphasized urgency; the practical solutions of the problems meant the determination of optimal plans, the probability of success of this plan, of that plan, numbers and more numbers. High speed computers did not exist; all the numerical work had to be done on desk calculators consuming much time and effort. Endowed with unusual energy and with a special spirit of getting the job done and done right, Evelyn [Fix] spent days and nights at her machine, aided by a group of students and some faculty wives, so that the needed results could be transmitted on time, usually to New York but occasionally directly to England. During these years Evelyn also continued her own studies and lectured to students.

菲克斯的努力使她在统计学和概率论的实际应用方面获得了宝贵的专业知识。她于 1948 年获得博士学位,并继续在加州大学伯克利分校工作,享受了许多富有成效的合作。其中之一是与约瑟夫·霍奇斯的合作,这导致了1951 年的技术报告。这篇非常简短但非常重要的论文的最后一个等式指的是我们之前阐明的规则:给定一组带标签或标记的数据点,可以为一个新的未标记的数据点分配相同的标签,即与超维向量空间中其最近邻居的标签或标记。

Fix’s efforts gave her valuable expertise in the practical use of statistics and probability theory. She got her Ph.D. in 1948 and stayed on at UC Berkeley, enjoying many productive collaborations. One of them was with Joseph Hodges, which resulted in the technical report of 1951. The very last equation of this extremely short but important paper refers to the rule we elucidated earlier: Given a set of labeled, or tagged, data points, a new, unlabeled data point can be assigned the same label, or tag, as that of its nearest neighbor in the hyperdimensional vector space.

作为一名正在寻找与模式识别相关的博士论文主题的研究生,Peter Hart 偶然发现了 Fix 和 Hodges 的论文和最近邻规则。他对此很感兴趣,想弄清楚它的理论属性。“世界上最实用的东西就是一个好的理论,”Hart 告诉我。“如果你知道一个程序的理论属性,你就可以放心地使用它,而不必费心进行无休止的实验来弄清楚它的作用是什么、什么时候有效、什么时候无效。”

As a graduate student looking for a doctoral thesis topic related to pattern recognition, Peter Hart stumbled upon the Fix and Hodges paper and the nearest neighbor rule. He was intrigued and wanted to figure out its theoretical properties. “The most practical thing in the world is a good theory,” Hart told me. “If you know the theoretical properties of a procedure, you can have confidence employing that without having the bother of conducting endless experiments to figure out what it does or when it works and when it doesn’t work.”

哈特找到了刚刚加入斯坦福大学担任代理助理教授的托马斯·科弗,两人讨论了研究最近邻规则的理论性质。经过两个小时的交谈,哈特确信他希望科弗成为他的论文导师;但有一个行政上的障碍。由于科弗只是一名“代理”助理教授,所以他不能正式成为论文的第一位审阅者。不过,哈特很有先见之明。“即使在那时,我也相当善于判断人才,”他告诉我。“我想,你知道,当我需要签名时,我打赌他会一路晋升为助理教授。”哈特与科弗签约成为这位年轻学者的第一位研究生,开始研究最近邻规则背后的理论。他的工作为该算法确定了下限和上限:好的和坏的。如果你假设数据样本的数量无穷大,那么好的几乎和最好的解决方案一样好。坏的根本没有那么糟糕。至关重要的是,最近邻规则的优势在于它不对底层数据分布做出任何假设。

Hart approached Thomas Cover, who had just joined Stanford as an acting assistant professor, and the two discussed working on the theoretical properties of the nearest neighbor rule. Their two-hour conversation left Hart convinced he wanted Cover as his thesis advisor; but there was an administrative hitch. Because he was only an “acting” assistant professor, Cover couldn’t formally be the first reader on a dissertation. Hart was prescient, though. “Even back then I was a pretty good judge of talent,” he told me. “I thought, you know, by the time I need a signature, I bet he’ll get promoted all the way up to assistant professor.” Hart signed on with Cover as the young academic’s very first graduate student and began working on understanding the theory behind the nearest neighbor rule. His work established the lower and upper bounds for the algorithm: the good and the bad. The good was almost as good as the best possible solution, if you assumed that the number of data samples went to infinity. The bad wasn’t that bad at all. Crucially, the strength of the nearest neighbor rule was that it didn’t make any assumptions about the underlying data distribution.

再简单不过了

IT DOESN’T GET SIMPLER

对于数据分类而言,ML 算法没有比最近邻规则更简单的了。尤其是考虑到该算法的强大功能。让我们从灰色圆圈和黑色三角形的模拟数据集开始。

ML algorithms don’t get much simpler than the nearest neighbor rule for classifying data. Especially considering the algorithm’s powerful abilities. Let’s start with a mock dataset of gray circles and black triangles.

回想一下感知器算法。它无法区分圆形和三角形,因为这个数据集不是线性可分的:没有一条直线可以画出来划出这两类数据。不过,朴素贝叶斯分类器可以找到一条将圆形和三角形分开的曲线。我们稍后再讨论稍后再讨论,但现在,让我们先解决最近邻算法。我们必须解决的问题是:当给定一个新的数据点时,我们必须将其分类为圆形或三角形。

Recall the perceptron algorithm. It will fail to tell apart the circles from the triangles, because this dataset is not linearly separable: There’s no single straight line you can draw to delineate the two classes of data. A naïve Bayes classifier can find a windy line that separates the circles from the triangles, though. We’ll come back to that in a bit, but for now, let’s tackle the nearest neighbor algorithm. The problem we must solve is this: When given a new data point, we have to classify it as either a circle or a triangle.

最简单的最近邻算法本质上是绘制新数据点并计算其与初始数据集中每个数据点的距离,初始数据集可视为训练数据。(我们将使用欧几里得距离测量来达到我们的目的。)如果最接近新数据的数据点是黑色三角形,则新数据被归类为黑色三角形;如果它是灰色圆圈,则新数据被归类为灰色圆圈。就是这么简单。以下两个面板显示了如何根据新数据点的最近邻标记新数据点。(新数据点显示为星号,但颜色为灰色或黑色,具体取决于它被归类为灰色圆圈还是黑色三角形。)原始数据集与上一个面板中显示的数据集相同。

The nearest neighbor algorithm, in its simplest form, essentially plots that new data point and calculates its distance to each data point in the initial dataset, which can be thought of as the training data. (We’ll use the Euclidean distance measure for our purposes.) If the data point nearest to the new data is a black triangle, the new data is classified as a black triangle; if it’s a gray circle, the new data is classified as a gray circle. It’s as simple as that. The following two panels show how a new data point is labeled based on its nearest neighbor. (The new data point is shown as a star, but is colored either gray or black, depending on whether it’s classified as a gray circle or a black triangle.) The original dataset is the same as the one shown in the previous panel.

回到感知器算法,回想一下线性分离超平面将坐标空间划分为两个区域。最近邻算法也一样,只是在这种情况下,两个区域之间的边界不是直线(或更高维度的超平面)。相反,它是弯曲的、非线性的。看看上面的两个图,你可以想象一个边界,如果新的数据点落在边界的一侧,那么它将是更接近灰色圆圈,或者更接近黑色三角形。当 NN 算法仅检查一个最近邻居时,同一数据集的边界如下所示。您可以看到最接近灰色圆圈的新数据点(灰色星号)位于包含所有灰色圆圈的区域中,而最接近黑色三角形的新数据点位于包含所有黑色三角形的区域中。

Going back to the perceptron algorithm, recall that the linearly separating hyperplane divides the coordinate space into two regions. The nearest neighbor algorithm does the same, except in this case, the boundary between the two regions is not a straight line (or a hyperplane in higher dimensions). Rather, it’s squiggly, nonlinear. Look at the two plots above, and you can imagine a boundary such that if the new data point fell on one side of the boundary, it’d be closer to a gray circle, or else to a black triangle. Here’s what the boundary looks like for the same dataset when the NN algorithm examines just one nearest neighbor. You can see that a new data point (a gray star) that’s closest to a gray circle lies in the region that contains all the gray circles, and one that’s closest to a black triangle lies in the region containing all the black triangles.

这个简单的算法(我们稍后会详细介绍)实现了相当了不起的事情:它找到了一个非线性边界来将一类数据与另一类数据分开。但是,仅使用一个最近邻的算法的简单性掩盖了一个严重的问题。在继续阅读之前,你能弄清楚吗?

This simple algorithm—we’ll come to the details in a moment—achieves something quite remarkable: It finds a nonlinear boundary to separate one class of data from another. But the simplicity of the algorithm that uses just one nearest neighbor belies a serious problem. Can you figure it out before reading further?

为了帮助理解潜在问题,请考虑另一个数据集(如上所示),其中包含一个被人类错误分类为黑色三角形的数据点,该数据点位于灰色圆圈中间。在找到将圆圈与三角形分开的边界方面,您认为会发生什么?必须说,机器无法知道错误的黑色三角形被错误分类为黑色三角形。

To help understand the potential problem, consider another dataset (shown above), one that includes a data point that’s misclassified by humans as a black triangle and that lies amid the gray circles. What do you think might happen, in terms of finding the boundary separating the circles from the triangles? The machine, it must be said, would have no way of knowing that the errant black triangle had been misclassified as such.

根据数据,算法会找到一个非常复杂的非线性边界。以下是解决方案:

Given the data, the algorithm will find a nonlinear boundary that’s quite intricate. Here’s the solution:

注意非线性边界如何将坐标空间分成两个以上的灰白区域。错误分类的黑色三角形周围有一个小“岛”。如果您的新数据点位于该小岛内,即使它被灰色圆圈包围,它也会被归类为黑色三角形。

Notice how the nonlinear boundaries split up the coordinate space into more than two gray-and-white regions. There’s a small “island” surrounding the misclassified black triangle. If your new data point is within that small island, it’ll get classified as a black triangle even though it’s surrounded by gray circles.

我们看到的就是机器学习研究人员所称的过度拟合的一个例子。我们的算法过度拟合了数据。它找到了一个边界,甚至没有忽略任何一个错误的异常值。发生这种情况是因为算法只关注一个最近的邻居。不过,有一个简单的解决方案可以解决这个问题。我们可以简单地增加最近邻居的数量,以便与新数据点进行比较。邻居的数量必须是奇数(比如说,三个或五个或更多)。为什么是奇数?因为如果是偶数,我们可能会打成平局,这毫无意义。奇数确保我们能得到答案,不管对还是错。这是假设我们只处理可以聚类为两个类别的数据(在本例中为灰色圆圈和黑色三角形)。

What we have seen is an example of what ML researchers call overfitting. Our algorithm has overfit the data. It finds a boundary that doesn’t ignore even a single erroneous outlier. This happens because the algorithm is paying attention to just one nearest neighbor. There’s a simple fix, however, that addresses this problem. We can simply increase the number of nearest neighbors against which to compare the new data point. The number of neighbors must be odd (say, three or five or more). Why an odd number? Well, because if it were even, we could end up with a tie, and that’s of no use. An odd number ensures we’ll get an answer, right or wrong. This is assuming that we are working only with data that can be clustered into two classes (in this case, the gray circles and the black triangles).

这是相同的数据集,但现在算法寻找三个最近的邻居并根据多数投票对新数据点进行分类:

Here’s the same dataset, but now the algorithm looks for three nearest neighbors and classifies the new data point based on the majority vote:

非线性边界不再过多关注圆圈中间的那一个三角形。如果一个新的数据点落在这个三角形附近,它仍然会被归类为圆圈,因为这个三角形会被附近的圆圈否决。边界变得稍微平滑一些;它不会因为数据中的噪音而扭曲,在我们的例子中,噪音就是被错误分类的三角形。与我们仅使用一个最近邻居得到的边界相比,这种使用更多最近邻居的更平滑边界更有可能正确分类新的数据点。据说该算法对未见过的数据具有更好的泛化能力。(不过,有一个灰色圆圈最终出现在了边界的错误一侧。我们稍后会谈到这种违规者。)

The nonlinear boundary no longer gives undue attention to the one lone triangle amid the circles. If a new data point were to fall near this lone triangle, it’d still be classified as a circle, because the triangle would be outvoted by the nearby circles. The boundary has become somewhat smoother; it’s not contorting to account for the noise in the data, which in our case is the misclassified triangle. Such smoother boundaries, using a larger number of nearest neighbors, are more likely to correctly classify a new data point when compared with the boundary we got using just one nearest neighbor. The algorithm is said to be generalizing better to unseen data. (Though, there’s a gray circle that’s ended up on the wrong side of the boundary. We’ll come to such transgressors in a moment.)

这是另一个示例(见下图),其中最近邻居的数量增加到七个。形状对于相同的初始数据集,非线性分离边界看起来略有不同。

Here’s another example (see panel below), with the number of nearest neighbors being considered bumped up to seven. The shape of the nonlinear separating boundary looks slightly different for the same initial dataset.

不过,请注意一个关键点。如果你检查算法使用七个最近邻实现的非线性边界,你就会发现,虽然错误的三角形落在了圆形区域内,但其影响却大大减弱了。这很好,因为错误的三角形是“噪音”,即在训练数据中被意外错误分类为三角形的圆形。但不幸的是,有一个圆形落在了三角形区域内。然而,这个圆形在训练数据中被正确分类,但最终却落在了边界的错误一侧。如果你​​不想过度拟合,这就是你付出的代价。分类器(以边界为特征)可能会对训练数据集中的某些数据点进行错误分类。尽管训练数据中存在一些错误,但这样做是可取的,因为当你使用分类器尚未见过的数据测试它时,它犯的错误可能比使用过度拟合训练数据的分类器要少。现在只需记住这一点;我们稍后将对过度拟合与泛化进行更认真的分析。

Note something crucial, though. If you examine the nonlinear boundary that the algorithm achieves using seven nearest neighbors, you can see that while the errant triangle falls within the region meant for circles, its influence is largely diminished. That’s good, because the errant triangle is “noise,” a circle that was accidently misclassified in the training data as a triangle. But there’s a circle that, unfortunately, falls inside the region meant for triangles. This circle, however, was correctly classified in the training data and yet has ended up on the wrong side of the boundary. This is the price you pay if you don’t want to overfit. The classifier—which is characterized by the boundary—can misclassify some data points in the training dataset. The reason this is desirable, despite some errors made in the training data, is because when you test this classifier using data it hasn’t yet seen, it’ll likely make fewer errors than if you used a classifier that had overfit the training data. Just keep this in mind for now; we will come to a more serious analysis of overfitting versus generalization later.

尽管该算法很简单,但确定其有效性所需的数学知识(即算法在对新数据进行分类时出错的概率)却并不简单。彼得·哈特回忆说,最初他努力开发直觉,以证明该算法可以收敛到一个结果,并且与贝叶斯最优分类器(正如我们所见,这是机器算法所能达到的最佳效果)相比,其表现令人满意。当然,贝叶斯最优分类器是一种理想化,因为它假设可以访问数据的底层概率分布,或者我们对此类分布的最佳估计。NN 算法则处于另一个极端。我们拥有的只是数据,而算法几乎不做任何假设,实际上对底层分布知之甚少。例如,没有假设数据遵循具有一定均值和方差的高斯(钟形)分布。

As simple as this algorithm is, the mathematics needed to determine its efficacy—in terms of the probability that the algorithm makes errors while classifying new data—was anything but simple. Peter Hart recalled struggling initially to develop the intuition needed to prove that the algorithm could converge to a result and perform satisfactorily when compared with the Bayes optimal classifier (which, as we saw, is the best a machine algorithm can do). Of course, the Bayes optimal classifier is an idealization, in that one assumes access to the underlying probability distributions of the data, or our best estimates of such distributions. The NN algorithm functions at the other extreme. All one has is data, and the algorithm makes barely any assumptions about, and indeed has little knowledge of, the underlying distributions. There’s no assumption, for example, that the data follows a Gaussian (bell-shaped) distribution with some mean and variance.

为了弄清数学问题,哈特找到了斯坦福大学一位才华横溢的美籍华裔数学家钟开来。哈特向钟开来解释了他试图解决的问题。钟开来问哈特是否知道詹森不等式和受控收敛定理这两个数学结果。哈特说他知道。“然后他一本正经地说,‘好吧,你知道的够多了。现在你只需要更聪明一点。’然后他挥手让我离开他的办公室,”哈特告诉我。

Needing to make sense of the math, Hart sought out Kai-lai Chung, a brilliant Chinese American mathematician at Stanford. Hart explained to Chung the problem he was trying to solve. Chung asked Hart if he knew of two mathematical results called Jensen’s inequality and the dominated convergence theorem. Hart said that he did. “And, with a straight face, he said, ‘Well, you know enough. Now you just have to be smarter.’ And then he waved me out of his office,” Hart told me.

所以,哈特做到了这一点:他变得更聪明了。

And so, Hart did just that: He got smarter.

我们可以略微体会一下哈特想要转化为严格结果的直觉。(实际的定理和证明对我们来说有点太深奥了。)让我们回顾一下两种企鹅,阿德利企鹅和巴布亚企鹅,以及一个特征,喙的深度(见上图)。

We can get a taste of the intuitions that Hart wanted to turn into rigorous results. (The actual theorems and proofs are a touch too abstruse for us.) Let’s go back to the two penguin species, Adélie and Gentoo, and one feature, bill depth (see figure, above).

如果您可以访问这两个基础分布,那么给定一只新的、未分类的企鹅及其喙长度,您可以使用贝叶斯定理简单地计算出给定喙长度时企鹅是阿德利企鹅的概率,以及给定喙长度时企鹅是巴布亚企鹅的概率。假设对于给定的喙长度,企鹅是阿德利企鹅的概率为 0.75,是巴布亚企鹅的概率为 0.25。对于贝叶斯最优分类器,每次都是概率较高的那个获胜。该算法将始终将新企鹅归类为阿德利企鹅,即使它有 25% 的概率可能是错误的。

If you had access to the two underlying distributions, then given a new, unclassified penguin and its bill depth, you could use the Bayes theorem to simply calculate the probability that the penguin is an Adélie given the bill depth and the probability that the penguin is a Gentoo given the bill depth. Let’s say that, for some given bill depth, the probability of the penguin being an Adélie turns out to be 0.75 and of it being a Gentoo, 0.25. For the Bayes optimal classifier, the higher probability wins each time. The algorithm will always classify the new penguin as an Adélie, even though there is a 25 percent chance it could be wrong.

那么最近邻算法呢?下面是 1-最近邻算法(或 1-NN)所实现的思维图景。它在新数据点的紧邻范围内寻找代表阿德利企鹅和巴布亚企鹅的点。如果算法可以访问代表这两种企鹅的所有可能点,那么 75% 的附近数据点将是阿德利企鹅的数据点,25% 的数据点将代表巴布亚企鹅。但我们只有一小部分企鹅样本。1-NN 算法必须根据有限的数据集做出决策。这就像有一枚有偏差的硬币,75% 的时间正面朝上,25% 的时间反面朝上。当然,这枚硬币特定于一只新的、未分类的企鹅的喙深度:正面朝上,就是阿德利企鹅;反面朝上,就是巴布亚企鹅。因此,与贝叶斯最优分类器(它总是会声称它是阿德利企鹅)不同,1-NN 规则会说新企鹅四分之三的时间是阿德利企鹅,四分之一的时间是巴布亚企鹅。有了这种直觉,Hart 最终找到了确定误差下限和上限所需的数学知识由 1-NN 规则得出,然后将其扩展到 k 个最近邻居,即 k-NN 规则。这是一个开创性的、令人难忘的结果。上图显示了贝叶斯最优分类器与 NN 分类器的错误概率(或风险)之间的比较。

What about the nearest neighbor algorithm? Here’s a mental picture of what the 1–nearest neighbor, or 1-NN, rule accomplishes. It looks for points representing Adélie penguins and Gentoo penguins in the immediate neighborhood of the new data point. If the algorithm had access to all possible points representing the two types of penguins, then 75 percent of those nearby data points would be those of Adélie penguins and 25 percent would represent Gentoo penguins. But we have only a small sample of penguins. And the 1-NN algorithm must make its decision based on that limited dataset. It’s as if it had a biased coin that came up heads 75 percent of the time and tails 25 percent of the time. The coin, of course, is specific to a new, unclassified penguin’s bill depth: heads, it’s an Adélie; tails, it’s a Gentoo. So, unlike the Bayes optimal classifier, which will always claim it’s an Adélie, the 1-NN rule will say that the new penguin is an Adélie three-quarters of the time and a Gentoo one-quarter of the time. This intuition in hand, Hart eventually figured out the math needed to establish the lower and upper bounds of the errors made by the 1-NN rule and then extended it to k nearest neighbors, the k-NN rule. It was a seminal, memorable result. The above plot shows a comparison between probability of error (or risk) of the Bayes optimal classifier versus the NN classifier.

k-NN 算法的表现不会比贝叶斯最优分类器更好(虚线,下限)。对于 1-NN 和大量数据样本(n),该算法出错风险的上限显示为实线抛物线:它的表现不会更差。但随着 k 的增加,并且 k/n 保持较小,分类器的性能开始接近贝叶斯最优分类器:抛物线变平,甚至更接近虚线。

The k-NN algorithm can do no better than the Bayes optimal classifier (the dashed line, the lower bound). For 1-NN and large samples of data (n), the algorithm’s upper bound of the risk of being wrong is shown as the solid parabolic line: It can do no worse. But as k increases, and k/n remains small, the classifier’s performance starts approaching that of the Bayes optimal classifier: The parabola flattens and gets even closer to the dashed line.

同样,它的直觉是这样的:随着您收集越来越多的样本,新数据点的直接邻域会变得密集,代表阿德利企鹅和巴布亚企鹅的数据点会越来越多。在我们的示例中,新数据点周围的局部区域将有 75% 的阿德利企鹅和 25% 的巴布亚企鹅。如果样本总数变得非常大,那么 k 个最近邻居中的大多数是阿德利企鹅的概率将接近 1。在这种理想情况下,k-NN 算法(对于较大的 k 值)将始终将新数据点分类为阿德利企鹅 - 就像贝叶斯最优分类器一样。

Again, the intuition for it goes something like this: As you collect more and more samples, the immediate neighborhood of the new data point gets dense with data points representing Adélie and Gentoo penguins. For our example, the local region around the new data point will have 75 percent Adélie penguins and 25 percent Gentoo penguins. If the total number of samples gets very large, then the probability that a majority of the k nearest neighbors are Adélie penguins will approach one. In this idealized scenario, the k-NN algorithm (for large values of k) will always classify the new data point as an Adélie—as would the Bayes optimal classifier.

所有这些工作从开始到结束大约花了三个月的时间。哈特于 1964 年春天开始与他的导师汤姆·科弗交谈。到了春末,科弗前往剑桥,在麻省理工学院进行暑期实习。“我们已经有了这些结果,”哈特说。“汤姆,即使是作为一名新导师,也非常清楚,你不应该根据三个月的论文工作授予斯坦福博士学位,无论结果多么出色。所以,他看着我,说,‘你能延长这些结果吗?’”

All this work, from start to finish, took about three months. Hart had started talking with his advisor, Tom Cover, in the spring of 1964. By late spring, Cover was heading to Cambridge for a summer stint at MIT. “We already had these results,” Hart said. “Tom, even as a brand-new advisor, was very well aware of the fact that you’re not supposed to give a Stanford Ph.D. on the basis of three months’ worth of dissertation work, no matter how fabulous the results. So, he looks at me and he says, ‘Can you extend these results?’ ”

哈特已经证明了几乎所有类型的概率分布的结果:连续分布、具有不连续性(或中断)的分布以及具有无限多个不连续性的分布。科弗要求哈特将数学扩展到所有可测函数(其中函数表示概率分布)。“现在我们进入了测度理论的领域。除了数学家之外,没有人听说过它。你不想去那里。这就像纯数学家的概括。[有一句]名言,‘为纯数学干杯——愿它永远不会对任何人有用’,”哈特说。“我面不改色地点了点头。汤姆启程前往剑桥。”

Hart had already proven the results for almost all types of probability distributions: continuous ones, distributions that had discontinuities (or breaks), and those with infinitely many discontinuities. Cover asked Hart to extend the math to all measurable functions (where the functions represented probability distributions). “Now we get into the realm of measure theory. Nobody who’s not a mathematician has ever heard of it. You don’t want to go there. This is like a pure mathematician’s generalization. [There’s this] famous quote, ‘Here’s to pure mathematics—may it never be of any use to anybody,’ ” Hart said. “I kept a straight face and nodded. Tom took off for Cambridge.”

哈特回到了“一生挚爱”身边,她最初主修数学,后来改学历史,后来成为一名出版颇丰的作家。他告诉她科弗的建议,并补充说:“我根本不知道该如何开始。”这对夫妇利用暑假在斯坦福帆船俱乐部学习帆船,有时他们驾驶 15 英尺长的单桅帆船撞到湖中央的几棵树上。暑假结束后,科弗回来问哈特的进步情况。“我非常诚实地回答说,‘我一点进步都没有。’汤姆只是点点头,说,‘是的,我不认为你会进步,’”哈特说。

Hart came home to the “love of his life,” who had started off as a math major but then switched to history and would become a well-published author. He told her about Cover’s suggestion and added, “I have no idea how to even begin.” The couple took the summer off and learned how to sail at the Stanford sailing club, at times running their fifteen-foot sloop into some trees in the middle of a lake. The summer ended, and Cover came back and asked Hart about his progress. “I very truthfully replied, ‘I didn’t make any progress at all.’ Tom just nodded and said, ‘Yeah, I didn’t think you would,’ ” Hart said.

哈特做了一些工作来优化算法。这对于一篇论文来说已经足够了:总共六十五页,三倍行距,他在 IBM Selectric 打字机上打字。1966 年他获得了博士学位。“我才 25 岁,就开始了,”他说。

Hart did some more work to optimize the algorithm. It was more than enough for a dissertation: all of sixty-five pages, triple-spaced, typed up on an IBM Selectric typewriter. He got his Ph.D. in 1966. “I was barely twenty-five, and I was launched,” he said.

k-NN 算法非常成功,这既是因为它的简单,也是因为它的简单。从编写代码的人的角度来看,这里有一些伪代码来说明它所需的代码是多么少。(现在我们先讨论阿德利企鹅和巴布亚企鹅,所以有两类企鹅和两个特征,喙深和喙长;该算法可以很容易地推广到包括两个以上的类别和任意数量的特征。)

The k-NN algorithm has been extraordinarily successful, both because of and despite its simplicity. From the perspective of someone writing a piece of code, here’s some pseudo-code to illustrate how little it takes. (Let’s stick with Adélie and Gentoo penguins for now, so two classes of penguins, and two features, bill depth and bill length; the algorithm can be easily generalized to include more than two classes and any number of features.)

  • 步骤1.存储样本数据的所有实例。
    • 每只企鹅都是一个向量 [ x1 , x2 ],其中x1 = 喙深度,x2 = 喙长度。整个数据集存储在矩阵X中,其中Xm行(企鹅数量)和n列(特征数量)。
    • 每只企鹅还与一个标签y相关联,该标签等于 -1(Adélie)或 1(Gentoo)。因此,存储所有相应标签的y是一个 m 维向量。
  • Step 1. Store all instances of sample data.
    • Each penguin is a vector [x1, x2], where x1=bill depth and x2=bill length. The entire dataset is stored in a matrix X, where X has m rows (number of penguins) and n columns (number of features).
    • Each penguin is also associated with a label y, which is equal to -1 (Adélie) or 1 (Gentoo). So, y, which stores all the corresponding labels, is an m-dimensional vector.
  • 步骤 2.给定一个表示未分类的企鹅的新数据点,以向量x的形式表示,其元素为喙深度和喙长度 [ x1 , x2 ],执行以下操作:
    • 计算新数据点与原始数据集中每个数据点的距离X 。这给了我们一个包含m 个距离的列表d
    • 按距离增加的顺序对列表d进行排序(第一个元素到新点的距离最小,最后一个元素到新点的距离最大)。
    • 在对d进行排序时,同时重新排列y的元素,以便适当的标签(-1 或 1)与排序列表中的每个企鹅保持相关联。
  • Step 2. Given a new data point, representing an unclassified penguin, in the form of a vector x with elements for bill depth and bill length [x1, x2], do the following:
    • Calculate the distance of the new data point to each of the data points in the original dataset, X. This gives us a list of m distances, d.
    • Sort the list d in the order of increasing distance (the first element has the minimum distance to the new point, and the last element has the maximum).
    • As you sort d, simultaneously rearrange the elements of y, so that the appropriate labels (-1 or 1) remain associated with each penguin in the sorted list.
  • 步骤 3.取出排序列表d的前k 个元素。这些元素代表k 个最近邻居。收集与每个最近邻居相关的标签(-1 或 1)。计算 1 的数量和 -1 的数量。
  • Step 3. Take the first k elements of the sorted list d. These represent the k nearest neighbors. Collect the labels (-1 or 1) associated with each of these nearest neighbors. Count the number of 1s and the number of -1s.
  • 步骤 4.如果 1 的数量多于 -1 的数量,则新数据点的标签为 1 (Gentoo);否则它就是一只阿德利犬。
  • Step 4. If the 1s outnumber -1s, the new data point has the label 1 (Gentoo); else it’s an Adélie.

如此简单的算法能实现什么呢?你在网上被要求购买的所有东西又能实现什么呢?如果一些不愿透露姓名的公司想要推荐你购买某些书籍或观看某些电影,他们可以通过将你表示为某个高维空间中的向量(根据你对书籍或电影的品味)来实现这一点,找到你最近的邻居,看看他们喜欢什么,然后向你推荐那些书籍或电影。甚至果蝇也被认为使用某种形式的 k-NN 算法来对气味做出反应:当果蝇感知到某种气味时,它会将新气味与最相似的另一种气味联系起来,对于这种气味,它已经具有做出行为反应的神经机制。

What can such a simple algorithm achieve? How about all the stuff that you get asked to buy on the internet? If companies, which shall not be named, want to recommend that you buy certain books or watch certain movies, they can do this by representing you as a vector in some high-dimensional space (in terms of your taste in books or movies), find your nearest neighbors, see what they like, and recommend those books or movies to you. Even fruit flies are thought to use some form of a k-NN algorithm to react to odors: When a fly senses some odor, it makes the connection between the new odor and another odor most like it for which it already has the neural mechanisms to respond behaviorally.

也许 k-NN 算法最重要的特性就是它是一种所谓的非参数模型。回想一下感知器。一旦你有了训练好的模型,使用一些初始训练数据集,感知器就简单地由其权重向量w来表征。这个向量的元素数量等于定义感知器的参数数量。这个数字不依赖于训练数据的数量。你可以用一百个数据实例或一百万个数据实例来训练感知器,但在训练结束时,超平面仍然由w定义。

Perhaps the most important feature of the k-NN algorithm is that it’s a so-called nonparametric model. Cast your mind back to the perceptron. Once you have a trained model, using some initial training dataset, the perceptron is simply characterized by its weight vector, w. The number of elements of this vector equals the number of parameters that define the perceptron. This number is not dependent on the amount of training data. You could train the perceptron with one hundred instances of data or a million, but at the end of the training session, the hyperplane would still be defined by w.

相比之下,非参数模型没有固定数量的参数。k-NN 模型就是一个例子。你基本上存储了训练数据的所有实例(无论是一百个还是一百万个)并在对新数据进行推理时使用所有实例。不幸的是,这也是 k-NN 算法最大的问题之一。随着数据集的大小激增,使用 k-NN 进行推理需要越来越多的计算能力和内存,以至于它会变得非常缓慢和资源密集。

A nonparametric model, by contrast, has no fixed number of parameters. The k-NN model is an example. You basically store all the instances of the training data (whether one hundred or one million) and use all the instances when doing the inference about a new piece of data. And therein, unfortunately, lies one of the biggest concerns with the k-NN algorithm. As the size of datasets has exploded, making inferences using k-NN requires increasing amounts of computational power and memory, to the point where it can become mind-numbingly slow and resource-intensive.

还有一个更明显的缺点,这是由高维数据的奇怪行为引起的。我们的三维思维无法理解所有可能出错的事情。

There’s one more glaring disadvantage, brought about by the strange behavior of data in very high dimensions. Our 3D minds are incredibly ill-equipped to appreciate all that can go wrong.

维度的诅咒

THE CURSE OF DIMENSIONALITY

美国应用数学家理查德贝尔曼在其 1957 年出版的《动态规划》一书的前言中谈到了处理极高维数据的问题,并创造了“维数灾难”这一短语。

In the preface to his 1957 book, Dynamic Programming, the American applied mathematician Richard Bellman wrote about the problems of dealing with extremely high-dimensional data and coined the phrase “the curse of dimensionality.”

有很多方法可以理解这种诅咒。让我们从一些相当简单的事情开始。取一个可以用一个特征描述的数据样本:取值在 0 到 2 之间的某个值。假设表示数据样本值的随机变量是均匀分布的,这意味着随机变量可以以相等的概率取 0 到 2 之间的任何值。

There are many ways to appreciate this curse. Let’s start with something rather simple. Take a sample of data that can be described using one feature: something that takes a value between 0 and 2. Let’s assume that the random variable representing the value of the data sample is uniformly distributed, meaning that the random variable can take any value between 0 and 2 with equal probability.

如果我们从该分布中抽样二十次,我们可能会得到类似上图的样本,其中 0 到 1 之间的数据点数量几乎与 1 到 2 之间的数据点数量相等。(请注意,此图中的 y 轴无关紧要;所有点都位于 x 轴上。)

If we were to sample twenty times from this distribution, we might get a sampling that looked something like the plot shown on above, with an almost equal number of data points between 0 and 1 as there are between 1 and 2. (Note that the y-axis is irrelevant in this plot; all the points lie on the x-axis.)

现在想象一下,您描述的对象需要两个特征,每个特征的值可以在 0 到 2 之间。同样,您从均匀分布中抽取 20 个点,但现在这个分布在 2D xy 平面上:从这个 2x2 正方形中的任意位置抽取样本的概率是相同的。您可能会得到以下结果:

Now imagine that the object you are describing requires two features, each with a value that can be between 0 and 2. Again, you sample 20 points from a uniform distribution, but now this distribution is over the 2D xy plane: The probability of drawing a sample from anywhere in this 2x2 square is the same. This is what you might get:

请注意,特征值介于 0 和 1 之间的空间区域仅占整个空间的四分之一。(在一维空间中,它构成了整个线段长度的一半。)因此,现在您可能会在这个空间区域(即单位正方形)中找到更少的数据样本。(在上面的示例图中,20 个中只有 4 个最终位于我们规定的区域。)

Note that the region of space for which the feature values lie between 0 and 1 forms only a quarter of the total space. (In the 1D space, it formed half the length of the total line segment.) So, now you are likely to find far fewer data samples in this region of space, which is the unit square. (In the example plot above, only 4 out of 20 end up in our prescribed region.)

让我们转到三维。现在有三个特征,每个特征的值都可以在 0 到 2 之间。同样,我们感兴趣的是特征介于 0 和 1 之间的空间体积;体积是单位立方体。在三维情况下,利息现在是总体积的八分之一。因此,如果您绘制了 20 个均匀分布在总体积(边长为 2 个单位的立方体)上的数据样本,那么您在单位立方体中找到的数字将大大减少。(在下面显示的示例中,只有 2 个是黑点;灰点位于外面。)

Let’s move on to three dimensions. Now there are three features, and each can have a value anywhere between 0 and 2. Again, we are interested in the volume of space for which the features lie between 0 and 1; the volume is the unit cube. In the 3D case, the volume of interest is now an eighth of the total volume. So, if you drew 20 samples of data that were uniformly distributed over the total volume, which is a cube with sides 2 units long, then the number you’d find in the unit cube would be drastically reduced. (In the example shown below, only 2 are black dots; the gray dots lie outside.)

机器学习算法可以被认为是一种经常检查这种空间体积的方法,例如,为了查看一个数据点是否与另一个数据点相似。这些数据通常是从某个底层分布中随机抽取的。在我们的小示例中,特征的值介于 0 到 2 之间,当维度数(或描述数据点所需的特征数)增加到 1,000 或 10,000 或更多时,在单位超立方体中找到数据点的机会迅速减少。(单位超立方体每边的长度等于 1。)当然,这是假设我们坚持使用 20 个随机抽取的样本散布在这个极高维空间中;在这种情况下,超维空间中的单位超立方体可能没有任何数据。正如巴黎笛卡尔大学的 Julie Delon 在她关于这个主题的演讲中所说,“在高维空间中,没有人能听到你的尖叫。”

ML algorithms can be thought of as methods that are often examining such volumes of space in order to see, for example, whether one data point is like another. And these data are usually randomly sampled from some underlying distribution. In our toy example with features that have values between 0 and 2, when the number of dimensions (or the number of features that are needed to describe a data point) rises, say, to 1,000 or 10,000 or more, the chance of finding a data point within a unit hypercube rapidly diminishes. (The length of each side of a unit hypercube is equal to 1.) That’s assuming, of course, that we are sticking to 20 randomly drawn samples scattered over this extremely high-dimensional space; in this case, a unit hypercube in the hyperdimensional space may be devoid of any data. As Julie Delon of the Université Paris–Descartes says in her talks on the subject, “In high dimensional spaces, nobody can hear you scream.”

缓解问题的一种方法是增加数据样本的数量(这样你附近的人就能听到你的尖叫声)。不幸的是,要使解决方案奏效,这个数字必须随着维度的数量呈指数增长,因此,该方法很快就会失去动力。诅咒接踵而至。

One way to mitigate the problem would be to increase the number of data samples (so that somebody can hear you scream in your immediate neighborhood). Unfortunately, for the solution to work, this number must grow exponentially with the number of dimensions, and so, the method runs out of steam very quickly. The curse takes over.

加州大学戴维斯分校数学教授托马斯·斯特罗默 (Thomas Strohmer) 在他的讲座中描述了另一种理解维数灾难的方法。我们以 k-NN 算法为例:它的工作原理是计算新数据点与训练数据集中每个样本之间的距离。假设相似点之间的距离小于不相似点之间的距离。但在高维空间中,数据点之间的距离发生了一些非常奇怪的事情。它与超球面和超立方体等物体的体积行为有关。

In his lectures, Thomas Strohmer, professor of mathematics at the University of California, Davis, describes another way to appreciate the curse of dimensionality. Let’s take the k-NN algorithm: It works by calculating distances between a new data point and each sample in the training dataset. The assumption is that similar points have smaller distances between them than those that are dissimilar. But something very curious happens to distances between data points in high-dimensional space. It has to do with the behavior of the volumes of objects such as hyperspheres and hypercubes.

让我们从半径为 1 的二维圆(也称为单位圆)开始。取 20 个数据样本,均匀分布在圆所围成的区域内。直觉告诉我们,样本均匀分布在整个区域,我们的直觉是正确的。同样的直觉也适用于半径为 1 的单位球体。我们可以想象球体的体积和均匀分布在该体积内的数据点。然而,当我们转向更高的维度时,我们的想象力和直觉就会动摇。要了解原因,请考虑某个更高维度d 中单位球体的体积。 体积由以下公式给出:

Let’s start with a 2D circle of radius one, also called a unit circle. Take a smattering of 20 data samples distributed uniformly over the area circumscribed by the circle. Intuition tells us that the samples are spread evenly over the entire area, and our intuition is correct. The same intuition holds true for a unit sphere with a radius of one. We can visualize the volume of the sphere and the data points distributed uniformly inside that volume. However, our imagination and intuition falter when we move to higher dimensions. To see why, consider the volume of a unit sphere in some higher dimension, d. The volume is given by this formula:

对于整数:

For whole numbers:

γ ( n ) = ( n -1)! = ( n -1) × ( n -2) × … × 2 × 1

Γ(n) = (n-1)! = (n-1) × (n-2) × … × 2 × 1

Gamma 函数也针对实数和复数进行定义。
The Gamma function is also defined for real and complex numbers.

我们不必担心这个公式其余部分的细节。正如 Strohmer 指出的那样,我们只需要知道,随着维度 d 值的增加,分母项的增加速度比分子项快得多这会产生非同寻常的影响。随着维度数趋于无穷大,单位球体的体积趋于零!然而,无论超立方体的维数是多少,单位超立方体的体积始终为 1。

We don’t need to worry about the specifics of the rest of this formula. As Strohmer points out, all we need to know is that the term in the denominator increases much, much faster than the term in the numerator, for increasing values of the dimension, d. This has extraordinary consequences. As the number of dimensions tends to infinity, the volume of the unit sphere tends to zero! However, the volume of a unit hypercube is always 1 regardless of the dimensionality of the hypercube.

这样,我们就可以比较嵌入或内接于立方体内的球体的体积变化情况,使得球体接触立方体的每个表面。(Alon Amit 在 Quora 上的一篇详细帖子中对这个问题进行了精彩的分析。)再次,让我们从熟悉的维度开始。以下是 3D 空间中的两个体积:

This allows us to compare what happens to the volume of a sphere that’s embedded, or inscribed, within a cube, such that the sphere touches each of the cube’s surfaces. (This problem is elegantly analyzed by Alon Amit, in a detailed post on Quora.) Again, let’s start with dimensions that are familiar. Here are the two volumes in 3D space:

假设立方体和球体的中心都在原点,立方体的表面与球体的接触点与原点相距一个单位长度。但立方体的顶点距离原点较远。

Let the cube and the sphere be centered at the origin, and the surfaces of the cube, at the point where they touch the sphere, be a unit length away from the origin. But the vertices of the cube are farther away.

例如,位于 (1, 1, 1) 的顶点的欧几里得距离为:

For example, the vertex at (1, 1, 1) is at a Euclidean distance of:

,从原点开始,原点定义为 (0, 0, 0)。

, from the origin, which is defined by (0, 0, 0).

其余 7 个顶点也是如此。

So are the 7 other vertices.

类似地,在 4D 空间中,立方体有 16 个顶点。位于 (1, 1, 1, 1) 处的顶点距离为:

Similarly, in 4D space, the cube has 16 vertices. The vertex at (1, 1, 1, 1) is at a distance of:

其它 15 个顶点也是如此。

So are the 15 other vertices.

类似地,对于 5D 超立方体,2 5 (= 32) 个顶点距离为 2.23 个单位;对于 10D 超立方体,2 10 (= 1,024) 个顶点距离为 3.16 个单位。这很快就会失控。1,000 维超立方体有 2 1000 (= 10.72 300 ) 个顶点,这比可观测宇宙中的原子数,每个顶点距离原点 31.6 个单位。

Similarly, for a 5D hypercube, the 25 (= 32) vertices are 2.23 units away; and for a 10D hypercube, the 210 (= 1,024) vertices are each 3.16 units away. This gets out of hand very quickly. A 1,000-dimensional hypercube has 21000 (= 10.72300) vertices, which is way, way more than the number of atoms in the observable universe, and each of these vertices is 31.6 units away from the origin.

尽管顶点数量巨大,而且它们与原点的距离不断增加,但超立方体的表面在与单位超球面接触的点处,距离原点仍然只有 1 个单位。这说明超立方体的体积中有多少被封闭的超球面占据,又有多少被延伸到顶点的体积占据?

Despite the enormous number of vertices and their increasing distance from the origin, the surfaces of the hypercube, at the point where they touch the unit hypersphere, are still only 1 unit away from the origin. What does this say about how much of the volume of the hypercube is occupied by the enclosed hypersphere and how much of it by the volume that extends to the vertices?

我们知道,在三维空间中,立方体体积的很大一部分被封闭的球体占据。随着维度的增加,这一比例开始减少。我们看到,随着维度数趋于无穷大,单位超球体的体积趋于零。这意味着超立方体的内部体积单位超球面向上的体积消失,超立方体的大部分体积集中在顶点附近,且所有顶点彼此之间的距离相等。

We know that in 3D space a large fraction of the volume of the cube is taken up by the enclosed sphere. This fraction starts decreasing as we move up in dimensions. We saw that as the number of dimensions tends to infinity, the volume of the unit hypersphere tends to zero. This means that the internal volume of the hypercube taken up by the unit hypersphere vanishes, most of the volume of the hypercube ends up near the vertices, and all the vertices are equally far away from each other.

这一切与 k-NN 算法和机器学习有什么关系?好吧,假设我们感兴趣的数据点嵌入在超维立方体的体积中。它们是这个空间中的向量或点。随着维数的增加,这些点不会占据单位球体的内部体积,因为体积将趋于零;那里没有体积可占用。数据点最终会填充超立方体的角落。但由于角落数量异常庞大,大多数角落都没有数据点,而位于某些角落的数据点最终与所有其他点几乎等距,无论它们是否属于同一类。通过测量距离来确定相似性的整个想法就崩溃了。k-NN 算法的核心前提,即附近的点相似,不再站得住脚。该算法最适合低维数据。

What’s all this got to do with the k-NN algorithm and machine learning? Well, let’s say that data points that we are interested in are embedded in the volume of a hyperdimensional cube. They are vectors, or points, in this space. As the dimensionality grows, these points don’t occupy the internal volume of the unit sphere, because that’s going to zero; there’s no volume there to be occupied. The data points end up populating the corners of the hypercube. But given the extraordinarily large number of corners, most corners are devoid of data points, and the data points that do lie in some corners end up being almost equidistant from all other points, whether they belong to the same class or not. The whole idea of measuring distances to determine similarity falls apart. The k-NN algorithm’s central premise, that nearby points are similar, doesn’t hold water anymore. The algorithm works best for low-dimensional data.

鉴于这一祸害,机器学习有时会求助于长期以来一直是统计学家的一项强大技术:主成分分析 (PCA)。通常,事实证明,尽管数据可能非常高维,但区分聚类所需的数据变化大部分位于某个低维空间中。PCA 是一种强大的技术,可以将数据减少到某个可处理的低维数,从而使 ML 算法发挥其魔力。

Given this curse, machine learning sometimes turns to a powerful technique that has long been the staple of statisticians: principal component analysis (PCA). Often, it turns out that though the data may be very high-dimensional, much of the variation in the data that is needed to distinguish clusters lies in some lower-dimensional space. PCA is a powerful technique to reduce the data to some tractable number of lower dimensions, allowing an ML algorithm to do its magic.

在引入维数灾难之后,贝尔曼在动态规划中写道,由于这是多年来悬在物理学家和天文学家头上的诅咒,因此没有必要对尽管如此仍可能获得重大成果感到气馁。” PCA 并没有气馁,而是揭示了降维的强大力量。这是我们的下一站。

After having introduced the curse of dimensionality, Bellman writes in Dynamic Programming,Since this is a curse which has hung over the head of the physicist and astronomer for many a year, there is no need to feel discouraged about the possibility of obtaining significant results despite it.” Far from being discouraged, PCA reveals the awesome power of dimensionality reduction. That’s our next stop.

第六章

CHAPTER 6

矩阵中蕴含着魔力

There’s Magic in Them Matrices

埃默里·布朗正在做麻醉师住院医师培训时,他的一位主治医生对他说:“现在,看这个。”突然间,他们正在诱导麻醉的病人失去了意识。那是一个意义深远的时刻。现在,经过几十年的实践,布朗——哈佛医学院麻省总医院的麻醉学教授、麻省理工学院的计算神经学家、训练有素的统计学家和应用数学家——仍然觉得病人从有意识到无意识的转变“令人惊奇”。只不过,现在他是主治医生,告诉住院医生不仅要注意病人的生理模式(比如呼吸变化),还要注意从他们大脑记录下来的脑电图信号。

When Emery Brown was doing his medical residency to become an anesthesiologist, one of his attending doctors said to him, “Now, watch this.” All of a sudden, the patient in whom they were inducing anesthesia fell unconscious. It was a profound moment. Now, after decades of practice, Brown—a professor of anesthesia at Harvard Medical School’s Massachusetts General Hospital, a computational neuroscientist at MIT, and a trained statistician and applied mathematician—still finds the transition from consciousness to unconsciousness in his patients “amazing.” Except, these days, he’s the attending telling his residents to watch not just the patients’ physiological patterns (such as changes in breathing), but also the EEG signals being recorded from their brains.

大多数麻醉师并不太重视脑电图信号作为监测患者意识状态的手段。布朗和他的同事们希望改变这一现状。他们希望机器学习算法能够帮助麻醉师确定麻醉剂量,无论是诱导患者昏迷还是让患者清醒。这意味着倾听大脑。做到这一点需要收集高维脑电图数据。(维数由数据量和每个数据实例中的特征数量决定,在本例中,这取决于所使用的电极数量、信号的不同频率正在分析的数据以及录音的时长。但处理高维数据会耗尽计算资源。在布朗团队进行的一项研究中,仅从一个电极获得的每个人的数据就产生了每个时间间隔 100 个不同的频率成分和 5,400 个两秒时间间隔(总共三个小时的数据)。这是一个 100 x 5400 的矩阵,或 540,000 个数据点,每个人一个电极。研究人员对这些数据提出的问题是:如果查看任何给定时间间隔内 100 个频带中每个频带的 EEG 信号功率,我们能判断一个人是有意识的还是无意识的吗?

Most anesthesiologists don’t pay much attention to EEG signals as a means of monitoring a patient’s state of consciousness. Brown and his colleagues want to change that. They want ML algorithms to help anesthesiologists determine the dosage of anesthetics, either to induce unconsciousness or to bring patients out of it. And that means listening to the brain. Part of being able to do that involves collecting high-dimensional EEG data. (The dimensionality is determined by how much data there is and the number of features in each instance of the data, which, in this case, depends on the number of electrodes being used, the different frequencies at which the signals are being analyzed, and the duration of the recordings.) But working with high-dimensional data can drain computational resources. In one study done by Brown’s team, each person’s data, from just one electrode, yielded 100 different frequency components per time interval and 5,400 two-second time intervals (for a total of three hours of data). That’s a 100 x 5400 matrix, or 540,000 data points, for one electrode per person. The question the researchers were asking of this data was this: If one looks at the power in the EEG signal in each of the 100 frequency bands in any given time interval, can one tell whether a person is conscious or unconscious?

让问题变得容易处理的一种方法是从统计学家的工具箱中借用一个工具:一个简单、优雅且长期存在的方法,称为主成分分析 (PCA),它涉及将高维数据投影到数量少得多的轴上,以找到数据变化最大的维度。诀窍在于找到正确的低维轴集。首先,我们需要对 PCA 有一个直观的了解,这是整个数据科学和机器学习中最重要的方法之一。

One way to make the problem tractable is to borrow a tool from the statistician’s toolbox: a simple, elegant, and long-standing method called principal component analysis (PCA), which involves projecting high-dimensional data onto a much smaller number of axes to find the dimensions along which the data vary the most. The trick lies in finding the correct set of low-dimensional axes. First, we need to get an intuition for PCA, one of the most important methods in all of data science and machine learning.

婴儿 PCA

BABY PCA

下面是一些虚构数据的非常简单的例子,用于说明主成分分析背后的基本思想。

Here’s a very simple example of some made-up data to illustrate the basic idea behind principal component analysis.

考虑第一个图。每个数据点(显示为一个圆圈或三角形)都具有两个值或特征,即x1(沿 x 轴绘制)和x2(沿 y 轴绘制)。如图所示,沿 x 轴的数据变化与沿 y 轴的数据变化一样多。我们现在的任务是将该数据的维数从二维降低到一维,以便将数据中的大部分变化都集中在一个维度上。我们可以画一条代表一个轴或维度的线,然后将数据投影到该轴上吗?

Consider the first plot. Each data point, shown as a circle or a triangle, is characterized by two values, or features, x1 (plotted along the x-axis) and x2 (plotted along the y-axis). As depicted, there is as much variation in the data along the x-axis as there is along the y-axis. Our job now is to reduce the dimensionality of this data, from two to one, such that most of the variation in the data is captured along one dimension. Can we draw a line representing one axis, or dimension, and project the data onto that axis?

选择这个例子是因为它特别简单直观。答案相对明显。我们可以画一条以 45 度角延伸的虚线,并将其作为 x 轴。

This example was chosen because it is particularly easy and intuitive. There’s a relatively obvious answer. We can draw a dashed line that runs at an angle of 45 degrees and make that the x-axis.

现在将虚线想象为新的 x 轴。再画一条与它成 90 度的线,使其成为 y 轴。以下是转换后的坐标空间中的数据:

Now imagine the dashed line as being the new x-axis. Draw another line at 90 degrees to it, making that the y-axis. Here’s what the data look like in the transformed coordinate space:

现在您可以清楚地看到,沿 x 轴的数据变化比沿 y 轴的数据变化更大。同样明显的是,如果我们现在将数据投影到新的 x 轴上,我们会看到圆形和三角形之间有明显的分离,但如果将数据投影到 y 轴上,代表两种形状的点就会聚集在一起。因此,如果我们必须选择一个轴或维度来分析数据,我们会选择新的 x 轴。如果我们将数据投影到该轴上,我们会得到以下图表:

Now you can clearly see that there’s more variation in the data along the x-axis than along the y-axis. It’s also clear that if we now project the data onto the new x-axis, we’ll see a clear separation between the circles and the triangles, but if you were to project the data onto the y-axis, the points representing the two shapes would be bunched together. So, if we had to pick one axis, or dimension, to analyze the data, we’d pick the new x-axis. If we project the data onto that axis, we’d get this plot:

我们刚刚进行了主成分分析。我们获取了一个二维数据集;找到了一个一维成分或轴,它捕获了数据中的大部分变化;并将数据投影到该一维轴上。您可能想知道这样做的理由。首先,我们所做的工作使我们很容易找到(在本例中是可视化)数据中的分离 - 圆圈位于左侧,三角形位于右侧。分类算法(例如感知器算法)很容易找到该边界。(在这种一维情况下,边界是将圆圈与三角形分开的任何点。)该算法只需在一维空间中工作。一旦找到该边界,然后给定一个未知类型的新数据点 - 我们不知道它是圆形还是三角形 - 我们可以将其投影到单个“主成分”轴上,看看它是落在边界的右侧还是左侧,并据此对其进行分类。当然,这假设新的数据点来自我们执行 PCA 的相同数据分布。

We’ve just performed a principal component analysis. We took a two-dimensional dataset; found a one-dimensional component, or axis, that captured most of the variation in the data; and projected the data onto that one-dimensional axis. You might wonder about the rationale for doing so. Well, for one, what we did makes it easy to find (and, in this case, visualize) the separation in the data—the circles lie to the left, and the triangles to the right. It’s easy for a classification algorithm, such as the perceptron algorithm, to find that boundary. (In this 1D case, the boundary is any point that separates the circles from the triangles.) The algorithm need work only in 1D space. Once it has found that boundary, then given a new data point of unknown type—we don’t know if it’s a circle or a triangle—we can just project it onto the single “principal component” axis and see if it falls to the right or the left of the boundary and classify it accordingly. This, of course, assumes that the new data point is drawn from the same distribution of data on which we performed PCA.

当然,这只是一个微不足道的例子。从计算角度来说,通过将维度从两个减少到一个以将圆形与三角形分开,我们并没有获得太多好处。但如果原始数据集是高维的(就像布朗的麻醉研究中那样),那么将维度数量减少到某个较小的数字,以便那些较低的维度或主成分能够捕获数据中的大部分变化,这将是一个巨大的计算优势。我们丢弃的那些变化不大的维度总是有风险的,因为它们很重要。此外,我们假设变化很大的维度具有很高的预测价值。尽管存在这样的风险,我们还是会将数据从数万或数十万维(甚至更多)投影到少数几个维度,然后使用计算上可处理的算法来寻找数据中的模式。

Granted, this was a trivial example. We didn’t gain much, computationally speaking, by reducing dimensions from two to one to separate the circles from the triangles. But if the original dataset had been high-dimensional (as it was in Brown’s anesthesia study), then reducing the number of dimensions to some smaller number, such that those lower dimensions, or principal components, capture most of the variation in the data, would be a huge computational advantage. There’s always a risk that the dimensions along which there isn’t much variation, which we are throwing away, are important. Also, we are assuming that the dimensions along which there’s a lot of variation have a high predictive value. Such risks notwithstanding, we’d be projecting data from tens or hundreds of thousands of dimensions (or even more) down to a handful and then using computationally tractable algorithms to find patterns in the data.

为了更正式地理解这一切,我们首先介绍一下德语单词Eigen,它第一次被提及(在我们感兴趣的上下文中)是在 1912 年德国数学家大卫·希尔伯特(David Hilbert,1862-1943)的著作中,题为Grundzüge einer allgemeinen Theorie der Linearen Integralgleichungen(线性积分方程一般理论的基础)。Eigen 的意思是“特征的、独特的、内在的或固有的”。希尔伯特使用Eigenfunktionen表示“特征函数”,使用Eigenwerte表示“特征值”。我们的重点是“特征值”的两种用途:特征值和特征向量。它们将为我们理解主成分分析奠定基础。

To make more formal sense of all this, we begin with an introduction to the German word Eigen, the first mention of which (in the context that interests us) appears in a 1912 work by the German mathematician extraordinaire David Hilbert (1862–1943), titled Grundzüge einer allgemeinen Theorie der linearen Integralgleichungen (Fundamentals of a General Theory of Linear Integral Equations). Eigen means “characteristic, peculiar, intrinsic, or inherent.” Hilbert used Eigenfunktionen, for “eigenfunctions,” and Eigenwerte, for “eigenvalues.” Our focus will be on two uses of “eigen”: eigenvalues and eigenvectors. They will lay the groundwork for our understanding of principal component analysis.

特征值和特征向量

EIGENVALUES AND EIGENVECTORS

众所周知,机器学习的大部分内容都归结于对向量和矩阵的操作。向量只是一组按行或列排列的数字。向量中元素的数量就是其维数。

Much of machine learning, as you well know by now, comes down to manipulating vectors and matrices. A vector is simply a set of numbers arranged as a row or a column. The number of elements in the vector is its dimensionality.

[3 4 5 9 0 1] 是行向量

[3 4 5 9 0 1] is a row vector

这是表示同一向量的两种方式。在本例中,它的维数为 6。如果你想想一个 6 维坐标空间(例如,轴为 x、y、z、p、q、r),那么这个向量就是 6D 空间中的一个点。当然,我们很难想象三维以上空间中的任何事物。但从概念上讲,这并不是一个困难的想法:就像向量 [3 4 5] 是 3D 空间中的一个点(沿 x 轴 3 个单位,沿 y 轴 4 个单位,沿 z 轴 5 个单位),[3 4 5 9 0 1] 是 6D 空间中的一个点,每个轴上都有所需数量的单位。

These are two ways of representing the same vector. Its dimensionality, in this case, is 6. If you were to think of a 6-dimensional coordinate space (say, with axes x, y, z, p, q, r), then this vector would be a point in that 6D space. It’s, of course, challenging for us to visualize anything in higher than three spatial dimensions. But conceptually, this is not a difficult idea: Just like the vector [3 4 5] is a point in 3D space (3 units along the x-axis, 4 units along the y-axis, and 5 units along the z-axis), [3 4 5 9 0 1] is a point in 6D space, with the requisite number of units along each axis.

我们还从之前关于向量的讨论中知道,有时值得将向量视为具有方向:它是从原点到 n 维空间中点的线的方向。但出于机器学习的目的,最好习惯将向量简单地视为数字序列,或作为一行或一列的矩阵。

We also know from our earlier discussion about vectors that sometimes it’s worth thinking of a vector as having a direction: it’s the orientation of the line drawn from the origin to the point in n-dimensional space. But for the purposes of machine learning, it’s best to get used to vectors simply as a sequence of numbers, or as a matrix with one row or one column.

回顾一下,矩阵是一个矩形数字阵列。一般来说,m × n矩阵有m行和n列。以下是 3×3 矩阵的示例。行水平排列,列垂直排列。

To recap, a matrix is a rectangular array of numbers. Generically, an m × n matrix has m rows and n columns. Here’s an example of a 3×3 matrix. The rows run horizontally, the columns vertically.

因此,12是属于矩阵第一行第二列的元素,而32属于矩阵第三行第二列的元素。更一般地,m × n矩阵由以下公式给出

So, a12 is an element of the matrix that belongs to the first row, second column, and a32 is an element that belongs to the third row, second column. More generically, an m × n matrix is given by:

a ij为第i行第j 列的元素

, and aij is an element of row i and column j.

之前,我们了解了向量的一些基本运算。现在我们关心的一个运算是向量与矩阵的乘法。考虑一下:

Earlier, we saw some of the basic operations you can do with vectors. The one operation that’s of concern to us now is the multiplication of a vector by a matrix. Consider:

Ax = y,其中A是矩阵,xy是向量

Ax = y, where A is a matrix and x and y are vectors

举个例子,并扩展元素,我们得到:

Taking an example, and expanding the elements, we get:

如果矩阵Am行(在上面的例子中m= 2 )和n列(在我们的例子中,=3),那么向量x必须是具有n 个元素(或行)的列向量,或维数为n (=3,上文)。如果仔细观察,矩阵向量乘法涉及对矩阵的每一行与列向量进行点积。您将矩阵的每一行视为行向量。这就是为什么矩阵A的列数必须等于列向量x的行数或维数。

If the matrix A has m rows (m=2, in the above example) and n columns (=3, in our example), then the vector x has to be a column vector with n elements (or rows), or dimensionality of n (=3, above). If you look carefully, a matrix-vector multiplication involves taking the dot product of each row of the matrix with the column vector. You are treating each row of the matrix as a row vector. That’s why the number of columns of the matrix A must equal the number of rows, or dimensionality, of the column vector x.

在上面的例子中,输出向量y的维数为 2。此输出维数完全取决于矩阵A中的行数。如果矩阵A有 4 行,则输出向量y的维数为 4。

In the above example, the output vector y has a dimensionality of 2. This output dimensionality depends entirely on the number of rows in the matrix A. If matrix A has 4 rows, the output vector y would have a dimensionality of 4.

现在,每个向量都是某个 n 维空间中的一个点。在上面的例子中,输入向量x是 3D 空间中的一个点。但输出向量是 4D 空间中的一个点。将向量乘以矩阵可以变换向量,不仅改变其大小和方向,还改变其所在空间的维数。

Now, each vector is a point in some n-dimensional space. In our example above, the input vector x is a point in 3D space. But the output vector is a point in 4D space. Multiplying a vector by a matrix can transform the vector, by changing not just its magnitude and orientation, but the very dimensionality of the space it inhabits.

如果要在向量矩阵乘法过程中保留向量的维数,那么矩阵的形状应该是什么?稍加思考,很明显矩阵必须是方阵。为了使点积起作用,矩阵的列数应与向量x的维数相同。其行数也应与向量x的维数相同,以获得具有相同元素数的输出向量y。正是这种运算(向量与方阵的乘法)让我们了解了特征向量和特征值的概念。

If you want to preserve the dimensionality of the vector during a vector-matrix multiplication, what should be the shape of the matrix? A moment’s thought, and it’s clear that the matrix must be square. For the dot products to work, the number of the matrix’s columns should be the same as the dimensionality of the vector x. The number of its rows should also be the same as the dimensionality of the vector x, to get an output vector y with the same number of elements. It’s this operation—the multiplication of a vector with a square matrix—that gets us to the concept of eigenvectors and eigenvalues.

这种乘法只是将向量从其坐标空间中的一个位置移动到同一空间中的另一个点,或者将一个点从一个位置移动到另一个位置。一般来说,将一个点(例如,在 2D 空间中)从一个位置移动到另一个位置将同时改变向量的大小和方向。

Such a multiplication simply moves a vector from one location in its coordinate space to another point in the same space or moves a point from one location to another. In general, moving a point—say, in 2D space—from one location to another will change both the magnitude and the orientation of the vector.

例如:向量乘以矩阵得到向量上图显示了变换。

For example: the vector multiplied by the matrix results in the vector The graph above shows the transformation.

在图中,旧向量是指向 (1, 2) 的箭头,而新向量指向 (1, -3)。请注意,新向量的大小方向都发生了变化。对于这个 2D 平面中的几乎所有向量,乘以 2x2 矩阵(让我们将自己限制为以实数作为元素的方阵)将导致类似的变换:大小和方向的变化。

In the figure, the old vector is the arrow pointing to (1, 2), and the new vector is pointing to (1, -3). Note that the new vector has changed in magnitude and orientation. For almost all the vectors that are in this 2D plane, multiplying by a 2x2 matrix (let’s restrict ourselves to a square matrix with real numbers as its elements) will result in a similar transformation: a change in magnitude and orientation.

但是,每个方阵都有一些特殊的或特征性的方向(因此有“特征”的概念)。有代数方法可以确定给定矩阵的这些方向。为了我们的目的,我们假设存在这样的方法。事实上,如果你用 Python 编程,有代码可以为你做到这一点,或者你可以将矩阵值输入到由Wolfram Alpha,它会为您输出值。对于我们的示例矩阵,这些方向如下:

But there are some orientations associated with each square matrix that are special, or characteristic (hence, the notion of “eigen”). There are algebraic ways of figuring out what these directions are for a given matrix. For our purposes, let’s take it as given that there are such methods. In fact, if you are programming in Python, there’s code that will do this for you, or you could just type the matrix values into a Web interface provided by Wolfram Alpha, and it’d spit out the values for you. For our example matrix , here are those orientations:

图中描绘了两个长度为 1 的向量,即所谓的单位向量。它们代表了示例矩阵的特殊方向。如果您取一个沿着这些方向之一的向量并将其乘以示例矩阵,您将得到一个具有相同方向但长度可能不同的新向量。换句话说,新向量只是重新缩放或乘以某个标量值。请注意,您可以将旧向量乘以负标量值,得到一个相对于原点翻转的向量,但仍被认为具有相同的方向,因为它位于同一条线上。

Depicted in the figure are two vectors of length 1, or so-called unit vectors. These represent special orientations for our example matrix. If you take a vector that lies along one of these orientations and multiply it by the example matrix, you’ll get a new vector that has the same orientation but may have a different length. In other words, the new vector is simply rescaled, or multiplied by some scalar value. Note that you can multiply the old vector by a negative scalar value and get a vector that is flipped over, relative to the origin, but is still considered to have the same orientation, because it lies on the same line.

这些与方阵相关的特殊方向或向量称为特征向量。每个特征向量对应一个特征值。因此,如果您取一个与特征向量方向相同的向量并将其乘以矩阵,您将得到一个按相应特征值缩放的新向量。继续我们的示例,示例矩阵的两个特征值是 -2 和 1,相应的特征向量是 [0, 1] 和 [3, 1]。同样,您可以使用代数方法或代码,或者直接询问 Wolfram Alpha 来找到这些特征向量和特征值。

These special orientations, or vectors, associated with a square matrix are called eigenvectors. Corresponding to each eigenvector is an eigenvalue. So, if you take a vector with the same orientation as one of the eigenvectors and multiply it by the matrix, you’ll get a new vector that is scaled by the corresponding eigenvalue. Continuing with our example, the two eigenvalues for our example matrix are -2 and 1, and the corresponding eigenvectors are [0, 1] and [3, 1]. Again, you can find these eigenvectors and eigenvalues using either algebraic methods or code, or simply by asking Wolfram Alpha.

因此,让我们取一个沿 y 轴的向量,例如 [0, 2],并将其乘以矩阵。我们得到:

So, let’s take a vector that lies along the y-axis, say, [0, 2], and multiply it by our matrix. We get:

新向量是原始向量的缩放版本(缩放了 -2,即特征值),但它与之前位于同一条线上(见下图)。因此,-2 是特征值,对应的特征向量为 [0, 1]。

The new vector is a scaled version of the original vector (by -2, the eigenvalue), but it lies along the same line as before (see figure, below). So, -2 is an eigenvalue, and the corresponding eigenvector is [0, 1].

更一般地,特征向量和特征值的写法如下:

More generally, eigenvectors and eigenvalues are written in this way:

Ax = λ x,其中A是矩阵,x是特征向量,λ是特征值。也就是说,将向量x乘以矩阵A得到的向量等于x乘以标量值λ

Ax = λx, where A is a matrix and x is an eigenvector and λ is an eigenvalue. That’s saying that multiplying the vector x by the matrix A results in a vector that equals x multiplied by a scalar value λ.

对于 2×2 矩阵,最多有两个特征向量和两个特征值。特征值可能不同,也可能相同。

For a 2×2 matrix, there are at most two eigenvectors and two eigenvalues. The eigenvalues may or may not be distinct.

在斯坦福讲座中,Anand Avati 展示了一种简洁的方法来直观地展示矩阵对向量的作用以及该变换与特征向量和特征值之间的联系。让我们从一组单位向量开始,这些向量的尖端(点)排列成一个单位半径圆的圆周。将每个向量与一个方阵相乘。当您绘制变换后的向量时,您会得到一个椭圆。矩阵将圆挤压并拉伸成椭圆。

In his Stanford lectures, Anand Avati shows a neat way to visualize what a matrix does to vectors and the connection of that transformation to eigenvectors and eigenvalues. Let’s start with a set of unit vectors, arranged such that their tips (the points) form the circumference of a circle of unit radius. Multiply each of those vectors with a square matrix. When you plot the transformed vectors, you get an ellipse. The matrix has squished and stretched the circle into an ellipse.

同样,对于我们的示例矩阵,您将获得上图所示的图表。请注意两个黑色向量。它们是特征向量的方向,其中一个将单位向量缩放 -2,另一个缩放 1,如相应的特征值所决定。

Again, for our example matrix , you get the plot shown above. Notice the two black vectors. Those are the directions of the eigenvectors, one of which scales a unit vector by -2 and the other by 1, as dictated by the corresponding eigenvalues.

现在我们来讨论一种非常特殊类型的矩阵:方阵对称矩阵(为了我们的目的,请使用实数;请不要使用虚数)。

Now we come to a very special type of matrix: a square symmetric matrix (with real values, for our purposes; no imaginary numbers, please).

其中一个矩阵如下:

One such matrix would be:

请注意,矩阵关于从左上角(数字 3)到右下角(数字 2)的对角线对称。(在进行主成分分析时,我们将使用这种方形对称矩阵。)以下是此矩阵对一组形成圆的单位向量的作用:

Note that the matrix is symmetric about the diagonal that goes from the top left (number 3) to the bottom right (number 2). (We’ll use such square symmetric matrices when we come to principal component analysis.) Here’s what this matrix would do to a set of unit vectors that formed a circle:

这种变换非常巧妙。每个原始输入单位向量被变换为一个“输出”向量,输出向量合在一起形成一个椭圆。结果表明,特征向量位于椭圆的长轴和短轴上。这些特征向量彼此垂直;它们被称为正交特征向量。请注意,当矩阵不是平方对称时(如上例所示),特征向量不会正交。

There’s something quite elegant about the transformation. Each original input unit vector is transformed into an “output” vector, and the output vectors taken together form an ellipse. It turns out that the eigenvectors lie along the major and minor axes of the ellipse. These eigenvectors are perpendicular to each other; they are called orthogonal eigenvectors. Note that the eigenvectors won’t be orthogonal when the matrix is not square symmetric (as in the previous example).

我们一直在研究 2x2 矩阵对二维向量的操作。但我们讨论的所有内容都适用于任何维度的空间。假设您在 10,000 维空间中工作。那么 10,000 维空间中的一组单位向量(相当于二维空间中的圆)将通过正方形对称矩阵(具有 10,000 行和 10,000 列)转换为 10,000 维空间中的椭圆体。

We have been looking at 2x2 matrices operating on 2D vectors. But everything we have discussed holds true for a space of any dimension. Say you were working in 10,000 dimensions. Then a set of unit vectors in 10,000-dimensional space (the equivalent of a circle in 2D space) would be transformed by a square, symmetric matrix (which has 10,000 rows and 10,000 columns) into the equivalent of an ellipsoid in 10,000-dimensional space.

在二维情况下,具有实值元素的方对称矩阵具有两个特征向量和两个对应的特征值。在 10,000 维情况下,矩阵具有 10,000 个特征向量和 10,000 个特征值,并且这 10,000 个特征向量彼此正交。尝试将其可视化是毫无意义的。

In the 2D case, the square symmetric matrix with real-valued elements has two eigenvectors and two corresponding eigenvalues. In the 10,000-dimensional case, the matrix has 10,000 eigenvectors and 10,000 eigenvalues, and those 10,000 eigenvectors would be orthogonal to each other. It’s pointless attempting to visualize it.

协方差矩阵

COVARIANCE MATRIX

在用主成分分析处理真实数据集之前,我们还需要理解一个关键概念。让我们从一个简单的 3×2 矩阵开始。

We have one more crucial concept to get through before we can tackle a real dataset with principal component analysis. Let’s start with a simple 3×2 matrix.

这是一个很小的数据集,代表三个数据点:(h1,w1)、(h2,w2)和(h3,w3)。假设矩阵的每一行代表一个人,行的第一列是人的身高,第二列是体重。如果我们将身高绘制在x 轴表示体重,y 轴表示体重,我们有三个点,每个点代表一个人。

It’s a tiny dataset that represents three data points: (h1, w1), (h2, w2), and (h3, w3). Let’s say that each row of the matrix represents a person, with the first column of the row being the person’s height and the second column their weight. If we plot the height on the x-axis and the weight on the y-axis, we’ll have three points, each representing one person.

为了便于论述,我们考虑一个包含 500 名此类人的图表。数据将由 500×2 矩阵(500 行,2 列)表示。如果数据集中的所有人都是具有相似表型和基因型、来自同一地理区域和种族、吃相同食物且有大致相同的锻炼行为的男性,那么你会认为他们的身高和体重是相关的。随着身高的增加,体重也会增加。但如果数据中包含了更重或更瘦的人,情况会怎样?现在你会看到许多身高相同但体重差异很大的人。如果你将女性也加入其中,情况会怎样?身高和体重之间的关系可能又会有所改变。

For argument’s sake, consider a plot of 500 such persons. The data would be represented by a 500×2 matrix (500 rows, 2 columns). If all the people in the dataset were men of similar phenotype and genotype, from the same geographical region and ethnicity, who ate the same kind of food and had much the same behaviors regarding exercise, you’d expect their heights and weights to be related. As the height increases, so does the weight. But what if the data included people who were heavier or skinnier? Now you’d see many people with the same height but with wildly different weights. What if you added women to the mix? The relationship between the height and weight would again likely change somewhat.

有一种方法可以将这些信息捕获到矩阵中。让我们回到我们的小 3×2 矩阵

There’s a way in which this information can be captured in a matrix. Let’s go back to our small 3×2 matrix .

首先,我们从矩阵的每个元素中减去该特征的预期值或平均值。因此,对于第一列,高度的平均值是 E( h );同样,权重的平均值是 E( w )。当我们从矩阵中各自的元素中减去这些平均值时,我们得到的结果如下:

First, from each element of the matrix, we subtract the expected value, or the mean, of that feature. So, for the first column, the heights have a mean of E(h); and similarly, the mean for the weights is E(w). When we subtract these mean values from their respective elements in the matrix, this is what we get:

将每个元素设置为其均值校正值的过程也称为中心化。这样做的原因有点涉及,所以我们将其视为隐式完成的事情。对于以下计算,假设矩阵X是均值校正的。

This procedure of setting each element to its mean-corrected value is also called centering. The reasons for doing this are a bit involved, so let’s take it as something that is done implicitly. For the calculations that follow, assume that the matrix X is mean-corrected.

现在我们取X的转置与其自身的点积。我们之前看到,列向量的转置将其变成行向量,反之亦然。矩阵X T的转置同样会翻转行和列。因此,

Now we take the dot product of the transpose of X with itself. We saw earlier that the transpose of a column vector turns it into a row vector, and vice versa. The transpose of a matrix, XT, similarly flips the rows and columns. So,

(注意:这是一个均值校正矩阵)

(Note: this is a mean-corrected matrix)

该矩阵是 2×3 矩阵与 3×2 矩阵点积的结果,有两行和两列,因此是一个方阵。

This matrix, the result of taking the dot product of a 2×3 matrix with a 3×2 matrix, has two rows and two columns, and hence, is a square matrix.

仔细看看矩阵中每个元素的值。第一个元素(第 1 行,第 1 列)只是我们原始数据集中三个人的身高平方和。或者说,它是均值校正身高平方和。

Take a close look at the values of each element of the matrix. The first element (row 1, column 1) is simply the sum of the squares of the heights of the three people in our original dataset. Or, rather, it’s the sum of the squares of mean-corrected heights.

h 1 2 + h 2 2 + h 3 2 = ( h 1 - E ( h )) 2 + ( h 2 - E ( h )) 2 + ( h 3 - E ( h )) 2

h12 + h22 + h32 = (h1 - E(h))2 + (h2 - E(h))2 + (h3 - E(h))2

正如我们在第 4 章中看到的,如果身高是一个随机变量,那么这个和也是h 的方差。类似地,对角线元素(第 2 行,第 2 列)是三个均值校正权重的平方和,因此是 w 的方差

As we saw in chapter 4, if height were a random variable, then this sum is also the variance of h. Similarly, the diagonal element (row 2, column 2) is the sum of the squares of the three mean-corrected weights and, hence, is the variance of w.

w 1 2 + w 2 2 + w 3 2 = ( w 1 - E ( w )) 2 +( w 2 - E ( w )) 2 + ( w 3 - E ( w )) 2

w12 + w22 + w32 = (w1 - E(w))2 +(w2 - E(w))2 + (w3 - E(w))2

因此, X T . X矩阵的对角线项是各个特征的方差:这些值越大,数据集中该方面的方差或分布就越大。

So, the diagonal terms of the XT.X matrix are the variances of the individual features: The larger these values are, the more variance, or spread, there is in the dataset for that aspect of people.

非对角线元素告诉我们一些更有趣的事情。首先,两个非对角线元素是相同的。这是一个方对称矩阵。(记住这一点,因为我们会回到它的重要性。)非对角线元素是每个人的平均校正身高和体重乘积的总和,并给出了所谓的随机变量对之间的协方差。

The off-diagonal elements tell us something even more interesting. First, both off-diagonal elements are the same. This is a square symmetric matrix. (Keep this in mind, for we’ll come back to why it is important.) The off-diagonal element is the sum of the products of the mean-corrected height and weight of each person and gives us what’s called the covariance between pairs of random variables.

让我们举一个小例子。以下是三个人的身高(以英尺为单位)和体重(以磅为单位),以矩阵形式表示:

Let’s take a small example. Here are the heights (in feet) and weights (in pounds) of three people, in matrix form:

身高平均值为: (5 + 6 + 7)/3 = 6

Mean of heights is: (5 + 6 + 7)/3 = 6

体重平均值为: (120 + 160 + 220)/3=166.67

Mean of weights is: (120 + 160 + 220)/3=166.67

均值校正矩阵为:

The mean-corrected matrix is:

现在我们假设这三个人的身高相同,体重却略有不同。前两个人分别身高 5 英尺和 6 英尺,体重较重,而身高 7 英尺的人体重严重不足。

Now let’s suppose that the three people had somewhat different weights, for the same heights. The first two persons, who are 5 feet and 6 feet tall, respectively, are heavier, and the 7-foot-tall person is severely underweight.

均值校正后:

After mean correction:

请注意,非对角线值 (-40) 低于前一种情况 (100)。这些非对角线值告诉我们,第一种情况下的身高和体重之间的关联性 (身高增加与体重增加相关) 比第二种情况下的身高和体重之间的关联性 (尽管身高增加,但体重却急剧下降)。

Note that the off-diagonal values are lower (-40) than in the previous case (100). These off-diagonal values are telling us that the heights and weights in the first case are more related to each other (an increase in height is associated with an increase in weight) than the heights and weights in the second case (where one person’s weight drops dramatically despite the increase in height).

所有这些的结果是,对角线元素捕获数据集中各个特征值的方差或扩展,而非对角线元素捕获特征。在我们的例子中,它是身高和体重的协方差。理论上,数据集中可以有任意数量的特征(身高、体重、胆固醇水平、糖尿病状况等)。如果是这样,我们刚刚计算的矩阵(也称为协方差矩阵)将变得越来越大,并且每个非对角线元素将捕获不同特征对的协方差。但它始终保持正方形和对称。

The upshot of all this is that the diagonal elements capture the variance, or spread, in the values of individual features of a dataset, whereas the off-diagonal elements capture the covariance between the features. In our example, it’s the covariance of the height and weight. In theory, you can have any number of features in the dataset (height, weight, cholesterol level, diabetes status, etc.). If so, the matrix we’ve just calculated—also called the covariance matrix—will get bigger and bigger, and each off-diagonal element will capture the covariance of a different pair of features. But it’ll always remain square and symmetric.

所有这些分析都指向这个陈述:协方差矩阵的特征向量是原始矩阵X的主成分。解释确切原因需要进行更多分析,但这里有一个可能有帮助的直觉:协方差矩阵描述了维度之间的关系,而协方差矩阵的特征向量产生了原始数据变化的主要维度。然而,得出这个直觉很有挑战性,所以我们先把它放在一边;相反,让我们专注于如何利用这个陈述。

All this analysis has been leading up to this statement: The eigenvectors of a covariance matrix are the principal components of the original matrix X. Explaining exactly why requires far more analysis, but here’s an intuition that might help: The covariance matrix describes how the dimensions relate to one another, and the eigenvectors of the covariance matrix yield the primary dimensions along which the original data vary. It’s challenging, however, to arrive at the intuition, so we’ll leave that aside; rather, let’s focus on how to make use of the statement.

假设您从一个m ×2 矩阵X开始,该矩阵有 m 行和 2 列,其中m表示个体数,2 表示特征数。计算均值校正后的协方差矩阵X T .X。这将是一个 2x2 方形对称协方差矩阵。找到其特征向量和特征值。然后,对于协方差矩阵的每个特征向量,相关的特征值会告诉您数据沿特征向量方向的方差有多大。例如,您可能会发现 — — 一旦计算出两个特征向量及其特征值 — — 原始数据中的几乎所有变化都位于其中一个特征向量(椭圆的长轴)的方向上。您可以忽略另一个方向,因为它告诉您的信息很少。二维问题已简化为一维问题。您现在必须做的就是将原始数据投影到该特征向量所表示的轴上。

You start with, say, an m×2 matrix X, with m rows and 2 columns, where m is the number of individuals and 2 is the number of features. Calculate the mean-corrected covariance matrix XT.X. This will be a 2x2 square, symmetric covariance matrix. Find its eigenvectors and eigenvalues. Then, for each eigenvector of the covariance matrix, the associated eigenvalue tells you how much variance there is in the data along the direction of the eigenvector. For example, you might discover—once you calculate the two eigenvectors and their eigenvalues—that almost all the variation in the original data lies in the direction of one of the eigenvectors (the major axis of the ellipse). You can ignore the other direction, for it tells you little. A 2D problem has been reduced to a 1D problem. All you must do now is project the original data onto the axis represented by that one eigenvector.

再次强调,将二维降为一维很简单,通常也没有必要。但如果数据有数百个特征(每个特征描述某个个体的某个方面),那么找到协方差矩阵的少数几个特征向量或原始数据集的主成分将使我们的任务变得非常容易,因为我们可以理解隐藏在数据中的模式。

Again, reducing 2D to 1D is trivial and usually unnecessary. But if the data has hundreds of features (with each feature describing, say, some aspect of an individual), then finding the handful of eigenvectors of the covariance matrix or the principal components of the original dataset makes our task immensely easier, in terms of understanding the patterns hiding in the data.

另一个示例问题值得研究,布朗团队的博士后约翰·阿贝尔 (John Abel) 经常使用该问题来强调 PCA 可能有用的方式。假设我们有一个车辆数据集,该数据集基于六个特征进行分类,例如高度、长度、车轮数量、乘客数量、大小和形状。每个特征都对应于分析车辆的一个维度。此数据集中的大部分变化可能位于映射到车辆大小和形状的维度上。如果您对此数据集进行主成分分析,第一个主成分将捕获大部分这种变化。如果您的意图是使用大小和形状的变化来对车辆进行分类,那么第一个主成分将非常有用。但是,如果车辆还有另一个特征:比如梯子数量?唯一有梯子的车辆类型是消防车。所有其他类型的车辆都没有梯子。因此,在原始数据集中,这个维度上的变化很小。如果您进行 PCA,并且只查看第一个主成分,它不会提供有关梯子数量的信息。如果你的任务是将车辆分类为消防车,那么找到第一个主成分并丢弃其余部分(特别是有关梯子数量的信息)可能会导致无法分辨哪辆车是消防车,哪辆车不是正如肯尼·罗杰斯所唱的,“你必须知道何时该坚持,何时该放弃。”

It’s worthwhile examining another example problem, which John Abel, a postdoc on Brown’s team, often uses to highlight ways in which PCA may be useful. Let’s say we have a dataset of vehicles that are categorized based on six features, such as the height, length, number of wheels, number of passengers, size, and shape. Each feature corresponds to a dimension along which the vehicle is being analyzed. Most of the variation in this dataset will likely lie along the dimensions that map onto the size and shape of vehicles. If you did principal component analysis on this dataset, the first principal component would capture most of this variation. And if your intent was to use the variation in size and shape to classify the vehicles, then the first principal component would be extremely useful. But what if the vehicles had one other feature: say, number of ladders? The only type of vehicle that would have ladders would be a fire truck. All other vehicle types would have zero ladders. So, in the original dataset, there would be very little variation along this dimension. If you did PCA, and looked only at the first principal component, it would not be informative about the number of ladders. If your task were to classify vehicles as fire trucks, then finding the first principal component and discarding the rest (particularly, information about the number of ladders) would likely make it impossible to tell which vehicle was a fire truck and which one wasn’t. As Kenny Rogers sang, “You’ve got to know when to hold ’em and know when to fold ’em.”

IRIS 数据集

THE IRIS DATASET

许多关于机器学习的书籍和讲座都会提到鸢尾花数据集。顾名思义,它是关于鸢尾花的数据。这些数据首次正式发表于 1936 年的一篇论文中,题为“分类问题中多重测量的使用”,作者是英国生物学家、数学家、统计学家和遗传学家罗纳德·艾尔默·费舍尔 (Ronald Aylmer Fisher)。费舍尔也是一位坚定的优生学家。这篇最初发表在《优生学年鉴》上、现在由《人类遗传学年鉴》在线发布的论文,难怪附有以下免责声明:“优生学家的工作往往充斥着对种族、民族和残疾人群体的偏见。在线发布本材料仅用于学术研究目的,并不代表对本文所表达的观点或优生学的认可或推广。”

Many books and lectures on machine learning invariably mention the Iris dataset. It is data about, as the name suggests, irises. The data were first formally published in a 1936 paper titled “The Use of Multiple Measurements in Taxonomic Problems,” by Ronald Aylmer Fisher, a British biologist, mathematician, statistician, and geneticist, among other things. Fisher was also a staunch eugenicist. It is little wonder that the paper, published originally in the Annals of Eugenics and now made available online by the journal Annals of Human Genetics, comes with this disclaimer: “The work of eugenicists was often pervaded by prejudice against racial, ethnic and disabled groups. Publication of this material online is for scholarly research purposes [and] is not an endorsement or promotion of the views expressed in any of these articles or eugenics in general.”

作为现代机器学习的教学工具,鸢尾花数据集堪称奇迹。Fisher 用它来说明一些统计技术。但他并不是数据的创造者。这些数据是由一位名叫 Edgar Anderson 的美国植物学家精心收集的,他在题为《加斯佩半岛的鸢尾花》的论文中诗意地回忆了他是如何收集这些鸢尾花的。Anderson 写道,在加拿大魁北克半岛海岸从 L'Isle-Verte 到 Trois-Pistoles 的一片土地上,这种花非常丰富:“在那里,人们可以随心所欲地采集一英里又一英里的鸢尾花,并收集一百朵盛开的变色鸢尾花和加拿大山鸢尾花进行比较它们分别来自不同的植物,但都来自同一牧场,并在同一天采摘,并由同一个人用同样的仪器在同一时间测量。在普通人看来,结果只是几页枯燥的统计数据,但对生物数学家来说,这是值得花十年时间寻找的丰富信息。”

The Iris dataset is a bit of a marvel as a didactic tool for modern machine learning. Fisher used it to illustrate some statistical techniques. But he wasn’t the creator of the data. They were collected, painstakingly, by an American botanist named Edgar Anderson, who, in his paper titled “The Irises of the Gaspe Peninsula,” poetically recalled how he went about gathering them. Anderson wrote about the abundance of these flowers on a stretch from L’Isle-Verte to Trois-Pistoles on the peninsular coast in Quebec, Canada: “There for mile after mile one could gather irises at will and assemble for comparison one hundred full-blown flowers of Iris versicolor and of Iris setosa canadensis, each from a different plant, but all from the same pasture, and picked on the same day and measured at the same time by the same person with the same apparatus. The result is, to ordinary eyes, a few pages of singularly dry statistics, but to the biomathematician a juicy morsel quite worth looking ten years to find.”

费舍尔论文中整理的安德森的数据涉及三种鸢尾花:山鸢尾、变色鸢尾维吉尼亚鸢尾。对于每朵花,安德森测量了四个特征或特性:萼片长度、萼片宽度、花瓣长度和花瓣宽度。萼片是花蕾周围的绿色叶状保护层,在盛开的花朵下面绽放。每种花有 50 个条目。所有这些数据都记录在一个 150x4 矩阵中(150 行,每朵花一行,4 列,每列代表花的每个特征)。还有一个 150 列的向量(或矩阵的第五列)告诉您花的类型。我们暂时先不讨论这些信息。

Anderson’s data, as collated in Fisher’s paper, were of three types of irises: Iris setosa, Iris versicolor, and Iris virginica. For each flower, Anderson measured four characteristics, or features: the sepal length, the sepal width, the petal length, and the petal width. The sepal is the green, leaflike protection around a flower bud that opens out underneath a blossoming flower. There are 50 entries per type of flower. All those data are captured in a 150x4 matrix (150 rows, one row for each flower, 4 columns, one column per feature of the flower). There’s also a 150-column vector (or the fifth column of the matrix) that tells you the type of the flower. We’ll leave this information aside for now.

问题是:我们能否直观地辨别出此数据集中的任何结构或模式?我们无法绘制数据,因为有四个特征,因此每朵花都以 4D 空间中的矢量形式存在。因此,我们无法直观地分辨出花朵之间差异最大的轴或轴。如果我们找到两个主要主成分并将数据投影到 2D 空间会怎样?然后我们可以绘制数据并查看是否出现任何模式。

Here’s our problem: Can we visually discern any structure or pattern in this dataset? We cannot plot the data, because there are four features, and hence, each flower exists as a vector in 4D space. So, we cannot visually tell anything about the axis or axes along which you find the most variance among the flowers. What if we find the two main principal components and project the data down to 2D space? Then we can plot the data and see if any pattern emerges.

我们从数据矩阵X开始,其中包含有关 150 朵花的信息。假设X是均值校正的。

We start with the data matrix X, which contains the information about 150 flowers. Assume X is mean-corrected.

方差矩阵为XT.X

The covariance matrix is: XT.X

因为X是一个 150x4 矩阵,而X T是一个 4×150 矩阵,所以协方差矩阵是 (4×150) 矩阵与 (150×4) 矩阵的点积。所以,它是一个 (4×4) 矩阵。由于协方差矩阵与实值平方对称,它有 4 个正交特征向量。所以,每个特征向量都是一个 4D 行向量或列向量。将这四个特征向量放在一起,就得到了另一个 (4×4) 矩阵。称之为W

Because X is a 150x4 matrix and XT is a 4×150 matrix, the covariance matrix is the dot product of a (4×150) matrix with a (150×4) matrix. So, it’s a (4×4) matrix. Given that the covariance matrix is square symmetric with real values, it has 4 orthogonal eigenvectors. So, each eigenvector is a 4D row or column vector. Put the four of them together, and you get another (4×4) matrix. Call it W.

我们必须确保W是有序的:第一列是特征值最大的特征向量,第二列是第二大特征向量,依此类推。第一个特征向量是数据方差最大的方向;下一个特征向量的方差稍小,依此类推。

We must ensure that W is ordered: The first column is the eigenvector with the largest eigenvalue, the second column with the next biggest, and so on. The first eigenvector is the direction in which the data have the most variance; the variance associated with the next eigenvector is a little less, and so on.

我们将取前两个特征向量。这是我们的两个主要主成分。我们取两个特征向量是为了能够绘制并轻松可视化数据;在 ML 算法的范围内,您可以使用任意数量的低维。将两个特征向量并排排列,我们得到一个 4×2 矩阵;称之为W r(表示W简化)。

We’ll take the first two eigenvectors. These are our two main principal components. We are taking two so that we can plot and easily visualize the data; within the confines of an ML algorithm, you can work with any number of lower dimensions. Arrange the two eigenvectors side by side, and we get a 4×2 matrix; call it Wr (for W-reduced).

现在我们必须将原始数据集X投影到这两个轴上。原始数据集有 4 列,即特征。转换后的数据集称为T ,将有两列,即特征。取W rX的点积可得出T

We now must project the original dataset X onto these two axes. The original dataset had 4 columns, or features. The transformed dataset, call it T, will have two columns, or features. Taking the dot product of Wr and X gives us T.

T = X. W r

T = X. Wr

X是一个 (150×4) 矩阵。

X is a (150×4) matrix.

W r是一个(4×2)矩阵。

Wr is a (4×2) matrix.

因此,T是一个 (150×2) 矩阵。

So, T is a (150×2) matrix.

我们的 150 朵花现在已从 4D 向量缩减为 2D 向量。在原始数据集中,每个特征或维度都有意义:例如萼片长度或花瓣宽度。但在 2D 空间中,这两个特征没有物理意义。但二维中的每个特征都封装了原始维度对其贡献的程度。

Our 150 flowers have now been reduced from 4D vectors to 2D vectors. In the original dataset, each feature, or dimension, had a meaning: sepal length or petal width, for example. But in the 2D space, the two features don’t have a physical meaning. But each feature in two dimensions encapsulates something about how much each of the original dimensions contributes to it.

现在让我们看看当所有 150 朵花都绘制在 2D 平面上时会发生什么,其中 x 轴是第一个主成分,y 轴是第二个主成分。您将得到下图所示的结果。除了一组点与另一组更大的点很好地分开之外,很难看出任何东西。

Now let’s see what happens when all the 150 flowers are plotted on a 2D plane, where the x-axis is the first principal component and the y-axis is the second principal component. You get the figure shown below. It’s kind of hard to tell anything except that one group of points is well separated from another, larger group.

这多少能提供一些信息,但我们可以做得更多。现在该看看我们放在一边的第五列数据了。第五列将原始 150x4 矩阵中的每一行与一种花型相关联。如果我们绘制与刚才相同的图形,但赋予不同的形状和颜色(灰色圆圈、灰色正方形或黑色三角形),结果会怎样?数据点,具体取决于它是山鸢尾、变色鸢尾还是维吉尼亚鸢尾。神奇的事情发生了。花朵在二维图中清晰地聚集在一起:

It’s somewhat informative, but we can do more. It’s time to look at the fifth column of data that we have kept aside. The fifth column had associated each row in the original 150x4 matrix with a flower type. What if we plot the same figure that we just did but give a different shape and color (gray circle, gray square, or black triangle) to the data point, depending on whether it’s Iris setosa, Iris versicolor, or Iris virginica. Something magical happens. The flowers clearly cluster in the 2D plot:

我们刚刚见识了主成分分析的威力。通过将数据的维度从四维减少到两维,我们可以将数据可视化。碰巧的是,两个主要主成分捕获了数据中几乎所有的方差(有办法确定到底有多少),而且方差如此之大,让我们能够清楚地看到二维模式。我们很幸运。然而,高维数据可能在其大部分(如果不是全部)主成分上表现出相当大的差异。例如,在二维情况下,当单位圆变成椭圆时,如果椭圆的长轴和短轴几乎等长,会怎么样?在这种情况下,两个轴将包含相等的变化量。在这种情况下,进行这种分析几乎没有好处。您不妨坚持使用原始数据,因为没有办法有效地降低维度而不丢失有价值的信息。

We have just seen the power of principal component analysis. By reducing the dimensionality of the data from four to two, we can visualize the data. And it so happened that the two main principal components captured almost all the variance in the data—there are ways to figure out exactly how much—and the variance was such that it allowed us to clearly see the pattern in 2D. We got lucky. However, there can be high-dimensional data that shows considerable variance along most of, if not all, its principal components. For example, in the 2D case, when the unit circle was transformed into an ellipse, what if the major and minor axes of the ellipse were almost equally long? In that case, both axes would contain equal amounts of variation. In such a situation, there’s little benefit to doing this analysis. You may as well stick to your original data, as there’s no way to effectively reduce the dimensionality and not lose valuable information.

还有另一种思考我们刚刚做的事情的方法。我们将数据投影到更低维度、更易于计算的空间中,然后将各种花标记为一种或另一种类型。这就是机器学习的用武之地。如果我们现在得到一个新的数据点,但它缺少花类型的标签,我们该怎么办?好吧,首先我们将该数据点投影到相同的两个主成分上并绘制它。我们可以看到它落在哪里,只需目测它,就能知道它是哪种花。或者我们可以使用我们之前看到的一些算法,比如最近邻算法,来对新数据点进行分类。

There’s another way of thinking about what we’ve just done. We projected our data into a lower-dimensional, more computationally tractable space and then labeled the various flowers as being of one type or another. Here’s where machine learning comes in. If we are now given a new data point, but it’s missing the label for the type of flower, what can we do? Well, first we project that data point onto the same two principal components and plot it. We can see where it lands and tell, just by eyeballing it, the type of flower it is. Or we could use some of the algorithms we saw earlier, such as the nearest neighbor algorithm, to classify the new data point.

但是,如果我们没有原始标签怎么办?假​​设 Anderson 尽管有着挑剔的数据收集习惯,却忘记在描述花朵的四个特征的每行旁边写下花朵的类型。机器学习工程师该怎么办?

But what if we didn’t have the original labels? Let’s say Anderson, despite his finicky data collection habits, forgot to write down the type of flower alongside each row describing the four features of the flower. What’s an ML engineer to do?

嗯,有一整个领域叫做无监督学习,它试图在未标记的数据中寻找模式或结构。一种可能被视为无监督学习前身的分析方法是聚类,一个直观的例子是 K 均值聚类算法。它需要知道数据中有多少个聚类。根据这些信息,算法会迭代地尝试找到每个聚类的几何中心。一旦找到这些“质心”,它根据每个数据点最接近的质心为其分配适当的标签(在本例中为 0、1 或 2)。我们可以在低维 Iris 数据集上运行此算法,无需标签。该算法找到三个质心,显示为黑点。

Well, there’s a whole field called unsupervised learning that tries to find patterns or structure in unlabeled data. One analysis method that might be considered a precursor to unsupervised learning is clustering, an intuitive example of which is called the K-means clustering algorithm. It needs to be told how many clusters there are in the data. Given that information, the algorithm iteratively tries to find the geometric center of each cluster. Once it finds those “centroids,” it assigns the appropriate label (in this case, 0, 1, or 2) to each data point, depending on its closest centroid. We can run this algorithm on the low-dimensional Iris dataset, sans the labels. The algorithm finds three centroids, shown as black dots.

您可以看到,尽管我们不知道花的类型,但主成分分析和 K 均值的组合使我们几乎可以识别出数据集中的不同簇。但是,如果仔细观察,以这种方式识别的簇与原始簇并不完全匹配,但它们很接近。当然,在这种情况下,我们不知道每个簇的含义 - 是山鸢尾、变色鸢尾还是维吉尼亚鸢尾?尽管如此,在数据(尤其是高维数据)中找到簇的能力是无价的。

You can see that despite our not knowing the flower types, the combination of principal component analysis and K-means brings us close to identifying distinct clusters in the dataset. Though, if you look closely, the clusters identified in this manner don’t exactly match up with the original clusters, but they are close. Also, of course, in this case we wouldn’t know what each cluster meant—is it Iris setosa, Iris versicolor, or Iris virginica? Nonetheless, the ability to find clusters in data, particularly high-dimensional data, is invaluable.

现在我们有了解决本章开头提出的问题的工具——埃默里·布朗和他的团队对麻醉患者所做的工作。

We now have the tools to tackle the problem we opened the chapter with—Emery Brown and his team’s work with people under anesthesia.

意识与麻醉

CONSCIOUSNESS AND ANESTHESIA

主成分分析有朝一日可以帮助我们在手术台上注射正确剂量的麻醉剂。至少,布朗和他的同事们希望添加支持 PCA 的机器学习麻醉师工具包。他们的研究在很大程度上展示了实现这一目标所需的一些步骤。

Principal component analysis could one day help deliver the correct dose of an anesthetic while we lie on a surgeon’s table. At least, Brown and his colleagues would like to add PCA-enabled machine learning to an anesthesiologist’s tool kit. And their study demonstrated in no small measure some of the steps needed to get there.

首先是数据。该团队收集了最清晰的麻醉患者脑电图信号数据集之一。十名受试者在约 2.5 小时内分别被注射了麻醉剂异丙酚。异丙酚的剂量逐渐增加,直到身体特定位置血液中麻醉剂的估计浓度从 0 微克/毫升降至 5 微克/毫升,然后再降至零。每隔两秒钟,受试者被要求通过按下相应的按钮来响应听觉命令。该响应用于评估他们的意识状态。同时,研究人员使用 64 个头皮电极记录了脑电图信号。“这是一个非常丰富的数据集,因为它是在如此严格控制的环境中收集的,”团队成员约翰·阿贝尔告诉我。从手术室接受手术的患者身上收集这样的数据几乎是不可能的。“在手术室收集脑电图很有挑战性。当你真正做手术时,记录脑电图并不是优先考虑的事情。”

First, the data. The team collected what has to be one of the cleanest datasets of EEG signals of people being put under an anesthetic. Ten subjects were each administered the anesthetic propofol over a period of about 2.5 hours. The propofol was gradually increased until the estimated concentration of the anesthetic in the blood at a given location in the body went from 0 to 5 micrograms per milliliter and then back down to zero. Every two seconds, the subject was asked to respond to an auditory command by pressing an appropriate button. The response was used to assess their state of consciousness. Meanwhile, the researchers recorded EEG signals using sixty-four scalp electrodes. “It’s a very rich dataset in the sense that it [was collected] in such a tightly controlled environment,” team member John Abel told me. Such data would have been near impossible to gather from patients undergoing surgery in an OR. “It’s challenging to collect EEG in the OR. Recording EEG is low on the list of priorities when you’re actually doing surgery.”

在分析的 PCA 部分,研究小组只查看了前额叶皮层一个位置记录的脑电图信号。利用从该电极收集的数据,他们计算了功率谱密度:信号中的功率与频率的关系。因此,每两秒的时间间隔就会产生一个 100 维的向量,其中向量的每个元素都包含某个频带中信号的功率。总频率范围为 0 到 50 Hz,每个频带对应于该范围的百分之一。

For the PCA part of their analysis, the team looked at the EEG signals recorded at just one location on the prefrontal cortex. Using the data collected from that one electrode, they calculated the power spectral density: the power in the signal as a function of frequency. So, for every two-second time interval, this resulted in a vector of a hundred dimensions, where each element of the vector contained the power in the signal in some frequency band. The overall frequency ranged from 0 to 50 Hz, with each band corresponding to a hundredth of that range.

如果受试者被监测了三个小时,那么监测每两秒就会产生一个 100 维向量,总共有 5,400 个这样的向量。一旦数据采集和处理完成,受试者的脑电图就会被封装在一个 (5400×100) 矩阵中,S。矩阵的每一行代表两秒间隔的功率谱密度;每一列代表每个频率点的功率谱密度。

If the subject is monitored for three hours, say, then the monitoring generates a 100-dimensional vector every two seconds, for a total of 5,400 such vectors. Once the data acquisition and processing are complete, a subject’s EEG is encapsulated in a (5400×100) matrix, S. Each row of the matrix represents the power spectral density for a two-second interval; each column represents the power spectral density in each frequency bin.

,其中m = 5,400(行),n = 100(列)

, where m = 5,400 (rows), n = 100 (columns)

除了功率谱密度的时间序列之外,生成的其他数据是推断出的受试者状态:有意识还是无意识。每两秒,如果推断出受试者有意识,则得到 1,否则得到 0。这是另一个 5,400 维向量c,每 2 秒一个条目。

The other data that’s generated apart from the time series of the power spectral density is the inferred state of the subject: conscious or not. Every two seconds, you get a 1 if the subject is inferred to be conscious, 0 if not. That’s another 5,400-dimensional vector c, one entry for each 2-second time slot.

= 5400

, m = 5400

该团队收集了 10 名患者的此类数据。我们现在准备进行 PCA。以下是其中一种方法。

The team collected such data for ten patients. We are now ready to do PCA. Here’s one way to go about it.

让我们只取十个受试者中的七个的矩阵。(我们将留出三个用于测试,稍后我们会讲到。)首先,将所有七个的矩阵堆叠在一起,一个接一个,这样我们就得到了一个 (37,800×100) 矩阵。这样做的原因是什么?它增加了 100 列中每一列的信息量。现在每列不仅包含一个受试者的功率谱密度,还包含七个受试者的功率谱密度。

Let’s take matrices for only seven of the ten subjects. (We’ll leave three aside for testing, which we’ll come to in a bit.) First, stack the matrices of all seven, one below the other, so that we get a (37,800×100) matrix. The reason for doing this? It increases the amount of information you have in each of the 100 columns. Each column now contains the power spectral density for not just one subject, but seven.

这个巨大的(37,800×100)矩阵就是我们的矩阵X。进行均值校正。

This giant (37,800×100) matrix is our matrix X. Perform mean correction.

协方差矩阵是X T X,它是一个 (100×37,800) 矩阵,上面点缀着一个 (37,800×100) 矩阵,因此得到一个 (100×100) 矩阵W。它有 100 个特征向量和 100 个特征值。我们取第一个三个特征向量(与三个最大的特征值相关),我们得到一个(100×3)矩阵W r

The covariance matrix is XTX, which is a (100×37,800) matrix dotted with a (37,800×100) matrix, giving us a (100×100) matrix W. This has 100 eigenvectors and 100 eigenvalues. We’ll take the first three eigenvectors (associated with the three biggest eigenvalues), and we get a (100×3) matrix Wr.

这三个特征向量就是我们的前三个主成分。阿贝尔说,事实证明,第一个特征向量对于意识状态来说并不是很有帮助。虽然它确实捕获了沿该轴的数据中的最大变化量,但这种变化并不能揭示出患者是清醒还是昏迷。(这是数据科学家必须从数据中提取信息的一种探索。)因此,我们删除第一个主成分,只使用接下来的两个。所以,W r现在是一个 (100×2) 矩阵。

These three eigenvectors are our first three principal components. It turns out, said Abel, that the first eigenvector is not very informative with respect to the state of consciousness. While it does capture the maximum amount of variation in the data along that axis, that variation doesn’t reveal much about whether a patient is conscious or unconscious. (This is the kind of poking around that a data scientist must do to extract information from the data.) So, we drop the first principal component and just use the next two. So, Wr now is a (100×2) matrix.

现在,我们可以将任何受试者的高维数据投影到这两个主成分或轴上。这需要对一个受试者的矩阵(一个 (5400×100) 矩阵)与W r(一个 (100×2) 矩阵)进行点积。结果:一个 (5400×2) 矩阵。矩阵的每一行代表患者的意识状态,从一百维(功率谱密度数据)投影到两维。每两秒的时间段有 5,400 个这样的状态。如果你在 xy 平面上绘制这些状态,用灰色圆圈表示“有意识”,黑色三角形表示“无意识”(回想一下,我们将这些数据以每个受试者的 5,400 维向量的形式呈现),以下是您所获得的内容:

Now we can project any subject’s high-dimensional data onto these two principal components, or axes. This involves taking the dot product of the matrix for one subject, a (5400×100) matrix, with Wr, a (100×2) matrix. The result: a (5400×2) matrix. Each row of the matrix represents the state of consciousness of the patient projected down from a hundred dimensions (of power spectral density data) to two. There are 5,400 such states, for every two-second time slot. If you plot these states on the xy plane, using gray circles for “conscious” and black triangles for “unconscious” (recall that we have this data in the form of a 5,400-dimensional vector for each subject), here’s what you get, as an example:

这真是太神奇了。意识和潜意识状态本不必分开。但它们确实分开了,尽管不是很清晰——你可以在灰色圆圈中看到一些黑色三角形,反之亦然——这就是机器学习发挥作用的地方。给定二维数据,可以构建一个有效的分类器,找到灰色圆圈和黑色三角形之间最佳分隔的边界。找到一条足够好的直线来分隔两个簇的线性分类器就可以了——这里的“好”意味着尽可能好。很明显,没有一条线可以把所有灰色圆圈归类为位于一侧,而所有黑色三角形归类为位于另一侧:数据中有一些重叠,因此会有一些错误。任务是尽量减少错误。例如,感知器算法永远找不到解决方案,因为在这种情况下不存在线性分离超平面。但是朴素贝叶斯分类器会找到解决方案,当然,k-最近邻算法也会找到解决方案。(在最后一章中,我们将讨论选择简单模型还是更复杂模型这个相当深刻的问题——这个主题通常被称为偏差-方差问题,以及选择其中一个模型的风险和前景。)

This is pretty amazing. The conscious and unconscious states needn’t have separated. But they do, albeit not very cleanly—you can see some black triangles among the gray circles, and vice versa—and this is where machine learning enters the picture. Given such data in two dimensions, one can build an efficient classifier that finds the boundary that best separates the gray circles from the black triangles. A linear classifier that finds a good-enough straight line to separate the two clusters will do—“good” here implies the best possible. It’s clear that there’s no line that can be drawn that will classify all the gray circles as lying on one side and all the black triangles as lying on the other: There’s some overlap in the data, and so, there will be some mistakes. The task is to minimize the mistakes. The perceptron algorithm, for instance, will never find a solution, because a linearly separating hyperplane doesn’t exist in this case. But a naïve Bayes classifier will find a solution, and of course, so will a k-nearest neighbor algorithm. (In the final chapter, we’ll address this rather profound issue of choosing a simple model versus a more complex model—a topic that is typically referred to as the bias–variance problem, and the perils and promises of preferring one over the other.)

训练分类器后,就可以测试它了。这就是我们放在一边的那三个对象发挥作用的地方。我们可以假装不知道对象在任何给定的两秒时间段内的意识状态。我们所要做的就是将捕捉该状态的 100 维向量投影到二维(上面使用的两个主轴),看看分类器应该说它应该是灰色(有意识)还是黑色(无意识)。但我们也有关于该两秒时间段内对象状态的实际数据,即所谓的基本事实。我们可以将预测与基本事实进行比较,看看分类器对它没有见过的数据的概括能力如何。整个努力的目的是确保建立一个最小化预测误差的分类器。然而,正如第 2 章中提到的,最小化预测误差并非易事。总体目标很重要,细微差别取决于要解决的具体问题。但是,假设预测误差最小化时牢记正确的目标。只有这样,才能将这样的事情引入现实世界,涉及接受手术的患者和向麻醉师推荐麻醉剂量的机器,麻醉师将其纳入决策过程。虽然建造这样的机器需要更多的研究和工程,但使用脑电图数据预测患者的意识状态将是这项工作的核心。主成分分析可能发挥作用。

Once you have trained a classifier, you can test it. This is where those three subjects we kept aside come in. We can pretend we don’t know the state of consciousness of the subject in any given two-second time slot. All we have to do is project the 100-dimensional vector that captures that state down to two dimensions (the two principal axes used above), and see what the classifier says it should be: gray (conscious) or black (unconscious). But we also have actual data about the state of the subject for that two-second time slot, the so-called ground truth. We can compare the prediction against the ground truth and see how well the classifier generalizes data it hasn’t seen. The aim of this entire effort is to ensure that one builds a classifier that minimizes prediction error. However, as was mentioned in chapter 2, minimizing prediction error is no simple matter. The overall objective matters, and the nuances depend on the exact problem being tackled. But let’s say that the prediction error was minimized keeping the right objective in mind. Only then can something like this be introduced into a real-world setting, involving a patient undergoing surgery and a machine that’s recommending the dosage of the anesthetic to an anesthesiologist, who factors it into their decision-making process. While there’s considerably more research and engineering required to build such a machine, predicting the state of consciousness of a patient by using EEG data would be central to such an effort. And principal component analysis might play a role.

到目前为止,我们已经遇到过高维数据带来问题的情况。PCA 向我们展示了一种寻找低维空间来理解数据的方法。但有时,低维数据可能会有问题。例如,如果您拥有的只是无法线性分离的低维数据,但您想使用线性分类器,因为它效果很好,该怎么办?在低维空间中这样做是不可能的。好吧,您可以做与 PCA 相反的事情,将数据投射到更高维度,有时甚至是无限维空间,其中总是存在一些线性分离的超平面。使用这个技巧的算法,也是下一章的主题,在 20 世纪 90 年代震撼了机器学习界。

So far, we have encountered situations where high-dimensional data posed problems. PCA showed us one way to find a lower-dimensional space in which to make sense of data. But sometimes, data in low dimensions can be problematic. For example, what if all you had was the lower-dimensional data that could not be linearly separated, but you wanted to use a linear classifier, because it works so well? It’d be impossible to do so in the lower-dimensional space. Well, you could do the opposite of what PCA does and project the data into higher dimensions, sometimes even into an infinite-dimensional space, where there always exists some linearly separating hyperplane. An algorithm that used this trick and, the subject of the next chapter, rocked the machine learning community in the 1990s.

第七章

CHAPTER 7

伟大的核绳技巧

The Great Kernel Rope Trick

1991 年秋天, Bernhard Boser 正在位于新泽西州霍姆德尔的 AT&T 贝尔实验室等待时机。他收到了加州大学伯克利分校的职位,但距离入职还有三个月。在贝尔实验室,他曾是技术人员,负责人工神经网络的硬件实现。但在这三个月里,他不想开始新的硬件项目。为了找点事情做,他开始和贝尔实验室的一位同事 Vladimir Vapnik 交谈,Vladimir Vapnik 是一位著名的俄罗斯数学家,也是统计学和机器学习领域的杰出专家,并且刚移民到美国不久的瓦普尼克让博瑟研究瓦普尼克在 20 世纪 60 年代设计的一种算法,现在出现在他影响深远的著作《基于经验数据的依赖关系估计》的英文译本附录中。附录名为“关于算法的注释”。Vapnik 希望 Boser 实现注释 4 中详述的算法,“构建最佳分离超平面的方法”。

Bernhard Boser was biding his time at AT&T Bell Labs in Holmdel, New Jersey. It was the fall of 1991. He had been offered a position at the University of California, Berkeley, but there were still three months to his start date. At Bell Labs, he had been a member of the technical staff, working on hardware implementations of artificial neural networks. But in those intervening three months, he didn’t want to start a new hardware project. Looking to keep busy, he started talking to one of his colleagues at Bell Labs, Vladimir Vapnik, an eminent Russian mathematician, a formidable expert in statistics and machine learning, and a recent immigrant to the United States. Vapnik asked Boser to work on an algorithm that Vapnik had designed back in the 1960s and that now appeared in an addendum to the English-language translation of his seminal book, Estimation of Dependencies Based on Empirical Data. The addendum was called “Remarks about Algorithms.” Vapnik wanted Boser to implement the algorithm detailed in Remark No. 4, “Methods for Constructing an Optimal Separating Hyperplane.”

正如我们之前所见,分离超平面是坐标空间中两个区域之间的线性边界:一条线将二维空间的两个区域分开,或一个平面将三维空间的两个区域分开,或一个超平面将某个高维空间一分为二。这样的分离超平面将数据点划分为两个簇。位于超平面一侧的点属于一个类别,而另一侧的超平面属于另一个。给定标记数据,Frank Rosenblatt 设计的感知器算法可以找到这样的超平面(如果存在的话)。但是,对于线性可分数据集,存在无数个分离超平面。有些分离超平面比其他的更好。

A separating hyperplane, as we saw earlier, is the linear boundary between two regions of coordinate space: a line separating two regions of 2D space, or a plane separating two regions of 3D space, or a hyperplane dividing some higher-dimensional space into two. Such a separating hyperplane delineates data points into two clusters. Points that lie to one side of the hyperplane belong to one category, and those that lie on the other side belong to another. Given labeled data, the perceptron algorithm devised by Frank Rosenblatt can find such a hyperplane, if one exists. However, for a linearly separable dataset, there exists an infinity of separating hyperplanes. And some are better than others.

下面是感知器算法寻找将圆与三角形分开的超平面的示例。给定感知器训练的初始数据集,这是一个完全有效的超平面。

Below is an example of the perceptron algorithm finding a hyperplane that separates the circles from the triangles. It’s a perfectly valid hyperplane, given the initial set of data on which the perceptron is trained.

现在想象一下,你得到了一个新的数据点,一个靠近原始三角形簇的三角形。感知器必须根据它之前找到的超平面对其进行分类。它会这样做:它会将该点归类为一个圆圈(以灰色显示)。而且它会出错。

Now imagine you are given a new data point, a triangle that lies close to the original cluster of triangles. The perceptron must classify it based on the hyperplane it previously found. Here’s what it’ll do: It’ll classify the point as a circle (shown in gray). And it will be wrong.

很容易想象另一个超平面(显示为一条虚线灰色线,比如说,旋转了几度)会正确地将新点归类为三角形。

It’s easy to imagine another hyperplane—shown as a dashed gray line, say, one that’s rotated a few degrees—that would correctly classify the new point as a triangle.

当然,即使有了新的超平面,新的数据点也有可能被错误分类。虽然人们可以通过目测二维点图来想象另一个可能表现更好的超平面,但这不是一种可持续的方法。另外,我们不要忘记,寻找超平面的过程通常涉及远大于两个维度的操作,这使得可视化变得不可能。我们需要一种系统的方法来寻找最佳的分离超平面,以便在对新数据点进行分类时将错误降到最低。Vapnik 的方法就是这样做的:它从无数个选择中找到了一个最佳超平面。

Of course, even with the new hyperplane, it’s possible that a new data point gets misclassified. And while one can eyeball a 2D plot of points to imagine yet another hyperplane that might do better, this is not a sustainable method. Also, let’s not forget that this process of finding a hyperplane often involves operating in dimensions that are far greater than two, making visualization impossible. What’s needed is a systematic method for finding the best possible separating hyperplane that would minimize errors when classifying new data points. Vapnik’s method did just that: It found an optimal hyperplane from an infinity of choices.

对页的第一个图直观地显示了 Vapnik 的算法。

The first figure on the opposite page visualizes Vapnik’s algorithm.

给定一些线性可分离的数据点集,该算法会找到一个超平面(黑线),使其两侧的边距最大化。请注意,有些数据点是黑色的,而其他数据点是灰色的。黑色数据点最接近分离超平面。在示例中,有一个黑色圆圈和两个黑色三角形。分离超平面与黑色圆圈和两个黑色三角形的距离相等。这就像我们在数据点的丛林中开辟了一条道路,如果你愿意的话,可以称之为“无人地带”。根据定义,没有数据点位于路径内。来自任一簇的最接近的点最多位于路径的边缘或边际,这些点为黑色。超平面是穿过路径中间的线。

Given some linearly separable sets of data points, the algorithm finds a hyperplane (the black line) that maximizes the margins on either side of it. Note that some data points are black, while the others are gray. The black ones are nearest to the separating hyperplane. In the example, there’s one black circle and two black triangles. The separating hyperplane is equidistant from the black circle and the two black triangles. It’s as if we cleared a path through the thicket of data points, a “no-one’s-land,” if you will. By definition, no data points lie inside the path. The closest ones from either cluster come at most to the edges, or margins, of the path, and those are the ones in black. The hyperplane is the line that goes through the middle of the path.

一旦找到这样的超平面,它就更有可能正确地将新数据点分类为圆形或三角形,而不是感知器找到的超平面。Boser 很快就实现并测试了该算法。它成功了。

Once you find such a hyperplane, it’s more likely to correctly classify a new data point as being a circle or a triangle than the hyperplane found by the perceptron. Boser implemented and tested the algorithm in no time at all. It worked.

Vapnik 算法背后的数学非常优雅,并且运用了我们迄今为止遇到的许多术语。但这是一个复杂的分析。(请参阅指向对数学的出色阐述。我们将力求直观的理解。

The math behind Vapnik’s algorithm is elegant and puts to use much of the terminology we have encountered so far. But it’s an involved analysis. (See a pointer to an excellent exposition of the math.) We’ll aim for an intuitive understanding.

分析的目的是找到下面所示的超平面。请注意,该图现在有一个向量w。这是权重向量表示超平面并垂直于超平面。超平面还具有偏差b,即其与原点的偏移量。这次还要注意,有两个黑色圆圈和一个黑色三角形,每个圆圈和三角形都是最接近超平面的数据实例。这是故意的,为了说明位于最大宽度路径边缘的每个类的数据点数量可以是任意的;这取决于您的训练数据集。唯一的保证是,根据定义,每个类别至少有一个数据点。

The goal of the analysis is to find the hyperplane depicted below. Note that the figure now has a vector w. This is the weight vector characterizing the hyperplane and is perpendicular to it. The hyperplane is also characterized by a bias b, which is its offset from the origin. Also note this time that there are two black circles and one black triangle and that each serves as an instance of data closest to the hyperplane. This is deliberate, to make the point that the number of data points of each class that lie on the edges of the maximally wide path can be anything; it depends on your training dataset. The only guarantee is that there is bound to be, by definition, at least one of each category.

假设圆标记为 -1,三角形标记为 +1。有n 个数据点(圆和三角形)。经过一些巧妙的向量代数运算,Vapnik 证明了最大化超平面两侧点之间分离的权重向量可以通过最小化函数 来找到,其中是权重向量的大小,同时满足每个数据点x i(向量)及其相关标签y i(标量,等于 -1 或 +1)的以下方程:

Let’s say that circles are labeled -1 and triangles +1. There are n data points (both circles and triangles). After some nifty vector algebra, Vapnik proved that the weight vector that maximizes the separation between points on either side of the hyperplane can be found by minimizing the function , where is the magnitude of the weight vector, while simultaneously satisfying this equation for every data point xi (a vector) and its associated label, yi (a scalar, equal to -1 or +1):

yi ( wxi + b ) ≥ 1

yi(w.xi + b) ≥ 1

wx i是权重向量与第 i 个数据点的点积上述等式也称为边距规则;它确保超平面两侧的点只能靠得这么近,不能再近,从而形成无人地带。

w.xi is the dot product of the weight vector with the ith data point. The above equation is also called the margin rule; it ensures that points on either side of the hyperplane can get only so close and no closer, thereby creating the no-one’s-land.

因此,我们有一个函数,由 给出,必须将其最小化。如果这只是一个寻找二次函数(2 次多项式,这就是我们所拥有的)最小值的问题,那将很简单。该函数是碗状的,我们可以使用梯度下降来找到最小值。但最小化而考虑第二组方程y i ( wx i + b ) ≥ 1 时,事情就变得有些复杂了。

So, we have some function, given by , which must be minimized. If it were simply a question of finding the minimum of a quadratic function (a polynomial of degree 2, which is what we have), that would be straightforward. The function is bowl-shaped, and we can use gradient descent to find the minimum. But minimizing while accounting for the second set of equations yi(w.xi + b) ≥ 1, complicates things somewhat.

现在我们有一个约束优化问题。我们必须将碗下降到一个同时满足约束条件并且最小的位置。意大利数学家和天文学家约瑟夫-路易斯·拉格朗日(1736-1813)设计了解决此类问题的一个解决方案,他的作品非常优雅,以至于威廉·罗文·汉密尔顿(我们在第 2 章中遇到过汉密尔顿;他将方程式刻在爱尔兰桥的石头上)被感动地称赞拉格朗日的一些工作是“一种科学诗。”

We now have a constrained optimization problem. We must descend the bowl to a location that simultaneously satisfies the constraint and is a minimum. One solution for such a problem was devised by Joseph-Louis Lagrange (1736–1813), an Italian mathematician and astronomer whose work had such elegance that William Rowan Hamilton—we met Hamilton in chapter 2; he was the one who etched an equation onto the stones of an Irish bridge—was moved to praise some of Lagrange’s work as “a kind of scientific poem.”

不仅仅是碗底

NOT JUST THE BOTTOM OF THE BOWL

在我们讨论拉格朗日的工作之前,先来做一次有趣但完全不合理的脑力练习,以激发接下来的数学讨论。想象一下,你正在爬上山谷上方的一座小山的斜坡。你是一名勘探者,有人告诉你,在山下有一条矿脉,矿脉呈一个圆圈,以谷底为中心,半径约为一英里。从谷底水平向矿脉挖掘距离太远。但还有另一种选择。周围的山坡地势很缓。你可以走上斜坡,这样你就站在矿脉上方,但高度却远不到一英里——最多几百英尺——然后从那里往下挖。所以,你走上去,到达一个正好位于地下所谓矿脉上方的位置。但这有一个问题。当你沿着山谷周围的环形路径行走时,确保你始终位于矿脉上方,你在画一个圆圈,但你的海拔可能会上下起伏,因为山坡不一定是平坦的。你的任务是找到环形路径上海拔最低的地方,这样当你钻下去时,需要的挖掘量最少。

Before we come to Lagrange’s work, here’s a fun, but entirely unreasonable, mental exercise to motivate the mathematical discussion to follow. Imagine you are clambering up the slope of a hill over a valley. You are a prospector and have been told that there is a vein of some rather exotic mineral beneath the hills that forms a circle centered on the valley floor and about a mile in radius. It’s rather far away to dig horizontally toward the vein from the bottom of the valley. But there’s another option. The surrounding hillsides go up very gently. You could walk up the slopes such that you were over the vein and, yet, be at a height that’s far less than a mile—say, a few hundred feet at most—and dig down from there. So, you walk up and reach a location that’s directly over the supposed vein of mineral underground. But there’s a problem. As you walk the hillside along a circular path around the valley, making sure you are always above the mineral vein, you are tracing a circle, but your altitude might go up and down, because the hillside is not necessarily level. Your job is to find the place on that circular path that has the least altitude, so that when you do drill down, it’ll require the minimum amount of digging.

我们刚刚所做的是提出一个约束优化问题。如果你只是被告知要找到山谷中海拔最低的地方(最小值),那么这应该很容易。你只需走到谷底,按照自己的梯度下降法进行操作即可。但现在你必须找到一个最小海拔(沿 z 方向或垂直方向的值),前提是必须距离谷底有一定的水平距离。x 和 y 坐标(沿谷底平面测量)必须位于半径为一英里的圆上,圆心位于山谷中间。

What we have just done is pose a constrained optimization problem. If you had simply been told to find the place with the least altitude in the valley (the minimum), well, that would have been easy. You just walk down to the valley floor, doing your own version of gradient descent. But now you must find a minimum altitude (a value along the z, or vertical, direction) given a constraint, which is that you must be a certain horizontal distance away from the valley’s bottom. The x and y coordinates (measured along the plane of the valley floor) must lie on a circle with a radius of one mile, with the circle centered on the middle of the valley.

以下是该问题的描述:

Here’s a depiction of the problem:

虚线圆圈是矿脉。实线波浪线是您在斜坡上行走的路径,这样您就位于矿脉的正上方。请注意,当您沿着实线行走时,海拔会发生变化,您必须找到海拔最低的地点(可能有多个这样的位置)。当然,如果您是一名真正的勘探者,您可以带着高度计四处走动,找到该路径上的最低点并开始向下钻探。从数学上讲,您将如何做到这一点?

The dashed circle is the mineral vein. The solid wavy line is the path you walk along on the slopes, such that you are directly above the vein. Note that the altitude varies as you walk along the solid line, and you must find the spot where the altitude is at a minimum (there may be multiple such locations). Of course, if you were an actual prospector, you could just walk around with an altimeter and find the lowest point on that path and start drilling down. How would you do it mathematically?

上面描述表面的数学方程有点复杂:

The mathematical equation depicting the surface shown above is a bit gnarly:

fxy)= x 2 + sin 4xy)+ xy

f(x, y) = x2 + sin4(xy) + xy

给定 x 和 y 坐标,该函数计算表面在 z 方向上的高度。让我们使用一个更简单的函数来操作:

Given the x and y coordinates, the function calculates the height of the surface in the z direction. Let’s take a much simpler function to work with:

fxy)= xy + 30

f(x, y) = xy + 30

函数看起来是这样的。你可以看到它沿着两条边上升,沿着另外两条边下降。这样的曲面有一个鞍点,即中间的平坦部分,但没有最大值或最小值。

Here’s how the function looks. You can see that it rises along two sides and drops off along two other sides. Such a surface has a saddle point, the flat bit in the middle, but it has no maximum or minimum.

现在想想我们的约束优化问题。让我们添加约束,即 (x, y) 坐标必须位于半径为 2 的圆上。因此,(x, y) 坐标受圆的方程约束:

Now think of our constrained optimization problem. Let’s add the constraint that the (x, y) coordinates must lie on a circle of radius 2. So, the (x, y) coordinates are constrained by the equation of a circle:

x2 + y2 = r2 = 22 = 4

x2 + y2 = r2 = 22 = 4

图中虚线圆位于 xy 平面上。实线圆是沿线移动时满足约束条件后得到的。3D 表面。尽管没有约束的原始 3D 表面没有最小值或最大值,但沿表面的约束路径确实有最小值和最大值。

In the figure, the dashed circle lies on the xy plane. The solid circle is what you get when you satisfy the constraint as you move along the 3D surface. Even though the original 3D surface sans the constraints had no minimum or maximum, the constrained path along the surface does have a minimum and maximum.

拉格朗日想出了一个优雅的解决方案来找到这种受限路径的极值(最小值和最大值)。要理解他的解决方案,我们需要从几个不同的角度来看待这个问题。首先,这是使用等高线(或我们之前遇到的山坡上的梯田,即沿着相同海拔或高度的山坡修建的道路)描绘的表面。

Lagrange came up with an elegant solution to find the extrema (the minima and the maxima) of such a constrained path. To understand his solution, we need a few different ways of looking at the problem. For starters, here’s a depiction of the surface using contour lines (or the terraces of rice paddies on hillsides that we encountered earlier, which are paths along the hillsides that are at the same altitude, or height).

因为我们必须找到实心圆上的最小值和最大值,所以我们真正感兴趣的是与实线相切的轮廓线的最小值和最大值。很明显,我们可以忽略不与约束曲线相交的轮廓线。它们没有意义,因为它们显然不满足约束。与曲线接触或相交的轮廓线在某些点确实满足约束。让我们先考虑一下它们,同时专注于寻找最小值。要找到最小值,我们需要走下坡路。在我们这样做的过程中,我们会遇到不同的轮廓线,包括与约束曲线相交的轮廓线、只在单个点接触约束曲线的轮廓线,然后是与曲线不相交的轮廓线。当我们从从高地到低地,这些等高线的值,或者说它们所代表的高度,不断下降。与约束曲线相交的等高线是我们感兴趣的。它代表最小高度,同时满足约束。同样的分析也适用于寻找最大值。在这两种情况下,我们都对与约束曲线相切的等高线感兴趣。

Because we must find the minimum and maximum along the solid circle, we are really interested in the smallest and the largest values of the contour lines, respectively, that touch the solid line. It’s clear that we can ignore contour lines that don’t intersect the constraining curve. They are of no interest, for they clearly don’t satisfy the constraint. The contour lines that either touch or intersect the curve do satisfy the constraint at certain points. Let’s think about them, while focusing on finding the minimum, first. To find a minimum, we need to go down a slope. As we do, we encounter different contour lines, including ones that intersect the constraining curve, one that just touches the constraining curve at a single point, and then contour lines that don’t intersect the curve. As we move from higher ground to lower ground, the values of these contour lines, or the heights they represent, keep decreasing. The contour line that grazes the constraining curve is the one of interest to us. It represents the minimum height while simultaneously satisfying the constraint. The same analysis holds for finding the maximum. In both cases, we are interested in contour lines that tangentially touch the constraining curve.

在 2D 中更容易看到这一点。以下是投影到 2D xy 平面的轮廓线(为清晰起见,显示的线条较少)。请注意,约束曲线是一个圆,这是理所当然的。

It’s easier to see this in 2D. Here are the contour lines projected down to the 2D xy plane (fewer lines are shown for clarity). Note that the constraining curve is a circle, as it should be.

轮廓线随着值减小(表面下降)而变暗,随着值增加(表面上升)而变淡。在此图中,您可以看到四条轮廓线与约束曲线接触。我们必须找到这些轮廓线的值。它们代表了给定约束条件下表面的极值。

The contour lines get darker as they decrease in value (the surface is falling) and fade away with increasing value (the surface is rising). In this figure, you can see four contour lines touching the constraining curve. We must find the values of those contour lines. They represent the extrema of our surface, given the constraint.

拉格朗日注意到,在每个极值处,约束曲线的切线和轮廓线的切线在接触点处基本上是同一条线。如果我们画一个垂直于每条切线的箭头,这两个箭头将指向同一方向。但是垂直于的箭头是什么意思呢?轮廓线的切线代表什么?我们之前见过。它是表面的坡度,即最陡的上升方向。因此,我们要说的是,在约束曲线和轮廓线的切线平行或本质上是同一条线的点处,它们的坡度指向同一方向。

Lagrange noticed that the tangent to the constraining curve and the tangent to the contour line, at each extremum, are essentially the same line at the point of contact. If we draw an arrow that is perpendicular to each of those tangent lines, those two arrows will point in the same direction. But what does an arrow that’s perpendicular to the tangent to the contour line represent? We have seen it before. It’s the gradient of the surface, the direction of steepest ascent. So, what we are saying is that at the point where the tangents to the constraining curve and the contour line are parallel, or are in essence the same line, their gradients point in the same direction.

回想一下,梯度是一个矢量。两个梯度指向同一方向并不意味着它们具有相同的长度或值。它们的幅度可能不同:一个梯度可能是另一个梯度的标量倍数。

Recall that the gradient is a vector. The fact that two gradients point in the same direction doesn’t mean they have identical lengths, or values. They may differ in their magnitude: One could be a scalar multiple of the other.

对于我们的示例函数,3D 表面的梯度由以下公式给出:

For our example function, the gradient of the 3D surface is given by:

∇f ( x , y );读作“xy 的 f 增量”

∇f(x,y); Read that as “delta f of x y”

我们将约束函数命名为g ( x , y )。因此:

Let’s call the constraining function g (x,y). So:

gxy)= x 2 + y 2 = 4

g (x,y) = x2 + y2 = 4

约束函数的梯度为:

The gradient of the constraining function is:

∇gxy

∇g(x,y)

拉格朗日的见解是:

Lagrange’s insight was:

∇f ( x , y ) = λ∇g ( x , y )

∇f(x,y) = λ∇g(x,y)

一个函数的梯度是另一个函数梯度的标量倍数λ 。

The gradient of one function is a scalar multiple, λ, of the gradient of the other function.

我们在第三章中看到,表示三维空间中曲面的函数的梯度是一个二维向量。向量的第一个元素是函数关于x的偏导数,向量的第二个元素是函数关于y 的偏导数。

We saw in chapter 3 that the gradient of a function that represents a surface in 3D space is a two-dimensional vector. The first element of the vector is the partial derivative of the function with respect to x, and the second element of the vector is the partial derivative with respect to y.

fxy)= xy + 30

f(x,y) = xy + 30

相似地,

Similarly,

gxy)= x 2 + y 2

g (x,y) = x2 + y2

现在,根据拉格朗日方法:

Now, according to Lagrange’s method:

∇f ( x , y ) = λ∇g ( x , y )

∇f(x,y) = λ∇g(x,y)

或者,

Or,

这给了我们两个等式:

This gives us two equations:

y = λ 2 xx = λ 2 y

y = λ2x and x = λ2y

但是我们有三个未知数(xyλ),但只有两个方程。我们至少还需要一个方程来求解所有未知数。当然,这就是约束方程:

But we have three unknowns, (x, y, λ), and only two equations. We need at least one more equation to solve for all the unknowns. That’s, of course, the constraining equation:

x2 + y2 = 4

x2 + y2 = 4

解这三个方程,我们得到三个未知数的值:

Solving these three equations, we get values for the three unknowns:

我们并不真正关心 lambda 的值。这只是一个占位符,一个让我们计算极值 (x, y) 坐标值的技巧。

We don’t really care about the value for lambda. That was just a placeholder, a trick to let us calculate the values for the (x, y) coordinates of the extrema.

这些坐标是:

Those coordinates are:

将这些值代入表示表面的方程式中,即可得到相应的 z 坐标的值:

Plug these values into the equation representing the surface, and you get the values for the corresponding z coordinates:

有两个点代表约束最小值,还有两个点代表约束最大值。这是有道理的,因为表面关于中心对称,约束曲线是圆。以下是 2D 和 3D 轮廓图中的点:

There are two points that represent the constrained minima and two that represent the constrained maxima. This makes sense, because the surface is symmetric about the center, and the constraining curve is a circle. Here’s what the points look like in the 2D and 3D contour plots:

更一般地,约束优化问题可以被认为是寻找所谓拉格朗日函数的极值,该函数由以下公式给出:

More generally, the problem of constrained optimization can be thought of as finding the extrema of the so-called Lagrange function, given by:

Lxλ)= fx-λgx

L(x,λ) = f(x) -λg(x)

这里的逻辑很简单。首先,让我们对这个等式的两边取梯度。

The logic here is straightforward. First, let’s take the gradient of both sides of this equation.

∇Lxλ)= ∇fx-∇λgx

∇L(x,λ) = ∇f(x) -∇λg(x)

在极值处,L的梯度必须为零。如果我们将左侧设置为零,那么我们就会得到上面分析的等式:

At the extrema, the gradient of L must be zero. If we set the left-hand side to zero, then we get back the equality that we analyzed above:

∇fx)= ∇λgx

∇f(x) = ∇λg(x)

这意味着优化问题归结为寻找拉格朗日函数的极值。我们的例子看起来相当简单,我们可以通过分析找到极值。但在大多数现实世界中,这个过程要复杂得多。此外,我们只研究了等式约束。约束也可以是不等式(比如,某个数必须大于或等于某个数)。无论如何,这种方法,通常称为拉格朗日乘数法(λ是拉格朗日乘数),将帮助我们在寻找最佳分离超平面的讨论中取得更进一步的进展。

What this means is that the optimization problem boils down to finding the extrema of the Lagrange function. The example we looked at was rather easy, and we could find the extrema analytically. But in most real-world cases, the process is far more complicated. Also, we looked only at a constraint that was an equality. Constraints can also be inequalities (say, something must be greater than or equal to some number). Regardless, it’s this method, often called the method of Lagrange multipliers (λ being the Lagrange multiplier) that will help us move farther along in our discussion of finding the optimal separating hyperplane.

最优边际

THE OPTIMAL MARGIN

在讨论拉格朗日乘数之前,我们的目的是找到最小化的权重向量w。还有一个额外的约束,即对于位于无人区边缘或边缘之外的数据点,必须满足以下等式(边缘规则):

Our purpose, before we segued into talking about Lagrange multipliers, was to find the weight vector w such that was minimized. There was an additional constraint, which was that the following equation (the margin rule) had to be satisfied for data points that either lie on or beyond the margins of no-one’s-land:

yi ( w . x i + b ) ≥ 1

yi(w.xi + b) ≥ 1

嗯,我们位于拉格朗日域。

Well, we are in Lagrange territory.

您可以将拉格朗日函数写成f ( x )和 ( y i ( w . x i + b )-1) 视为g ( x ),然后按照上述过程进行。但是,您得到的方程并不总是可以解析解的,您需要专门的数学技巧来找到拉格朗日乘数。(这些乘数的数量与约束方程的数量一样多,每个数据点都有一个这样的方程。)

You’d write down the Lagrange function by treating as f(x) and (yi(w.x i+ b)-1) as g(x) and then follow the process described above. However, the equations that you get are not always solvable analytically, and you need specialized mathematical techniques to find the Lagrange multipliers. (There are as many of these multipliers as there are constraining equations, and we have one such equation for each data point.)

我们将重点关注约束优化的结果。第一个结果是权重向量由以下公式给出:

We’ll focus on the results of the constrained optimization. The first result is that the weight vector turns out to be given by this formula:

每个 α i (alpha sub-i) 都是标量,特定于给定的数据点及其相关标签 ( x i , y i )。这些 alpha 是拉格朗日乘数。(为了与大部分文献保持一致,我们现在将它们称为 alpha 而不是 lambda。)给定一些训练数据,专门的优化技术可以为我们提供 alpha,从而帮助我们计算权重向量和偏差项。从数学分析中得出的另一个关键见解是,alpha仅取决于表示数据样本的向量的相互点积。请记住这一点。

Each αi (alpha sub-i) is a scalar and is specific to a given data point and its associated label, (xi, yi). These alphas are the Lagrange multipliers. (We are calling them alphas now instead of lambdas, to be consistent with much of the literature.) Given some training data, specialized optimization techniques can give us the alphas, and thus help us calculate the weight vector and the bias term. A further key insight that arises from the mathematical analysis is that the alphas depend only on the mutual dot product of the vectors representing the data samples. Keep this in mind.

从上面的公式可以看出,权重向量实际上是表示数据样本的向量的线性组合,组合的系数就是 alpha。这是一个惊人的结果。

The weight vector, as you can see from the equation above, turns out to be a linear combination of vectors that represent the data samples, with the coefficients of the combination being the alphas. This is an amazing result.

因此,一旦我们有了 alpha,我们就可以计算权重向量,它与偏差b一起确定超平面。然后很容易确定新数据点(我们称之为u)是位于超平面的一侧还是另一侧。

So, once we have the alphas, we can calculate the weight vector, which along with the bias, b, determines the hyperplane. Then it’s easy to figure out whether a new data point—let’s call it u—lies to one side of the hyperplane or the other.

新数据点u的标签

The label for a new data point u

我们可以用上式中的w值代入,得到第二个结果,即决策规则:

We can substitute the value for w from the above equation, to get the second result, the decision rule:

新数据点u的标签

The label for a new data point u

请注意,决策规则同样仅取决于新样本的点积,每个向量代表训练数据。结果表明,对于不在边缘的样本, α i将为零;因此,实际上,我们只处理位于边缘的那些数据点。

Note that the decision rule similarly depends only on the dot product of the new sample, with each of the vectors representing the training data. It turns out that αi will be zero for samples that don’t lie on the margins; so, effectively, we are dealing with only those data points that lie on the margins.

上图仅显示了位于边际的数据。这些数据点当然是向量,因为它们有助于定义或锚定边际,所以它们被称为支持向量。此外,所有这些分析都适用于任何维度的向量,而不仅仅是我们为了便于可视化而使用的二维向量。

The plot above shows only the data that lie on the margins. These data points are, of course, vectors, and because they help define, or anchor, the margins, they are called support vectors. Also, all this analysis holds for vectors of any dimension, not just the 2D vectors we used for easy visualization.

如果这个相当复杂的讨论让我们意识到一件事,那就是:最佳分离超平面仅取决于支持向量之间的点积;而决策规则(告诉我们新数据点u被归类为 +1 还是 -1)仅取决于u与每个支持向量的点积。因此,如果您的数据集有 10,000 个数据点,但其中只有 10 个是支持向量,那么您只需要担心计算这 10 个向量的相互点积以及新数据点与每个支持向量的点积。(本概述掩盖了为每个支持向量找到 α i所需的复杂优化。)

If this rather involved discussion got us to appreciate one thing, it is this: The optimal separating hyperplane depends only on the dot products of the support vectors with each other; and the decision rule, which tells us whether a new data point u is classified as +1 or -1, depends only on the dot product of u with each support vector. So, if your dataset had 10,000 data points, but only ten of them were support vectors, you would need to worry about calculating only the mutual dot products of these ten vectors and the dot product of the new data point with each support vector. (This recap glosses over the elaborate optimization that’s required to find the αi for each support vector.)

无论如何,这是一个相当非凡的发现,它出自 Vapnik 于 1964 年设计的算法。这是 Boser 于 1991 年秋天在贝尔实验室研究的算法。他完成了实现并测试了它对线性可分数据集的分类,但在去伯克利之前他还有时间。瓦普尼克建议博瑟尝试通过将数据投影到更高维度来对线性不可分数据集进行分类。博瑟开始思考这个问题。这个想法是,即使数据在其原始的低维空间中是线性不可分的,也可以将其投影到某个更高的维度,在那里它可能线性可分,并且可以使用最佳边际分类器来找到更高维的超平面。如果将超平面投影回低维空间,它看起来就像一些将数据分成簇的非线性曲线。

Regardless, it’s a rather extraordinary finding, and it falls out of the algorithm that Vapnik designed in 1964. It’s the algorithm that Boser worked on at Bell Labs in the fall of 1991. He finished implementing and testing it for linearly separable datasets, but he still had time to kill before moving to Berkeley. Vapnik suggested that Boser try classifying linearly inseparable datasets by projecting the data into higher dimensions. Boser began thinking about it. The idea is that even though the data is linearly inseparable in its original, low-dimensional space, one can project it into some higher dimension, where it could potentially be linearly separable, and one could use the optimal margin classifier to find the higher-dimensional hyperplane. If you projected the hyperplane back into the low-dimensional space, it’d look like some nonlinear curve separating the data into clusters.

将数据投影到高维空间的方法有很多。对于我们的目的而言,这种投影有两个主要问题。一个与 Vapnik 的原始算法有关,该算法需要对数据样本进行相互点积运算。假设原始数据集是十维的。那就需要对十维向量进行点积运算。如果这些数据在 10D 空间中是线性不可分的,并且如果将其投影到 1,000 维空间中,其中数据干净地聚集成两个可分离的类别,那么每个数据点都将由一个 1,000 维向量表示。该算法需要对这些极大的向量进行点积运算。随着您进入越来越高的维度,寻找可以找到线性分离超平面的最佳空间,计算点积的计算成本可能会变得高得令人望而却步。

There are many ways to project data into higher-dimensional spaces. For our purposes, such projections come with two major concerns. One has to do with Vapnik’s original algorithm, which requires taking mutual dot products of data samples. Let’s say the original dataset was in ten dimensions. That would require taking dot products of ten-dimensional vectors. If this data is linearly inseparable in 10D space, and if it were to be projected into 1,000 dimensions, where the data cleanly clumped into two separable categories, then each data point would be represented by a 1,000-dimensional vector. The algorithm requires dot products of these extremely large vectors. As you go into higher and higher dimensions, searching for that optimal space where you can find a linearly separating hyperplane, the computational costs of calculating dot products can get prohibitively high.

另一个担忧与这样的事实有关:有时人们希望将数据投射到具有无限维度的空间中。(我们很快就会看到这是如何实现的。)这具有巨大的优势,因为在无限维空间中,您总能找到一个分离超平面。但如何计算无限维向量的点积并不明显,更不用说将这些向量存储在计算机内存中了。那么,如何找到超平面呢?

The other concern has to do with the fact that sometimes one wants to project data into a space that has infinite dimensions. (We’ll soon see how that’s possible.) This has enormous advantages, because in an infinite-dimensional space, you can always find a separating hyperplane. But it isn’t obvious how to compute dot products of vectors of infinite dimensions, let alone store such vectors in computer memory. How, then, do you find the hyperplane?

一天早上,在他们一起上班的路上,Boser 开始和他的妻子伊莎贝尔·盖恩讨论他的项目,伊莎贝尔是一名机器学习专家,她的头脑更偏向数学。她也在贝尔实验室工作。盖恩对这类问题思考了很多,尤其是在她的博士论文中。她立即提出了一个解决方案,可以绕过在高维空间中计算点积的需要。它涉及一个巧妙的技巧,这个技巧的历史可以追溯到 20 世纪 60 年代其他俄罗斯数学家的工作。盖恩的洞察力,以及她随后与瓦普尼克和博瑟一起参与该项目,导致了有史以来最成功的机器学习算法之一的发明。

One morning, on their way to work together, Boser began discussing his project with his wife, Isabelle Guyon, an ML expert whose mind had a much more mathematical bent. She also worked at Bell Labs. Guyon had thought a lot about such problems, especially for her Ph.D. thesis. She immediately suggested a solution that would bypass the need to compute dot products in the higher-dimensional space. It involved a neat trick, one whose history goes back to work by other Russian mathematicians in the 1960s. Guyon’s insight, and her subsequent involvement in the project with Vapnik and Boser, led to one of the most successful ML algorithms ever invented.

内核技巧

THE KERNEL TRICK

20 世纪 80 年代初,伊莎贝尔·盖恩还是一名年轻的巴黎工程系学生,对控制论很感兴趣,正在寻找实习机会。她的一位教授杰拉德·德雷福斯(后来成为她的论文导师)建议她阅读物理学家约翰·霍普菲尔德的一篇论文。这篇论文描述了一种构建神经网络的新方法,这种神经网络可以训练来存储记忆。这些网络后来被称为霍普菲尔德网络(下一章的主题),专为存储记忆而设计,在神经网络研究人员中引起了轰动。作为一名实习生,盖恩开始研究这些网络,并在攻读硕士学位期间继续研究。她开发了一种更有效的训练霍普菲尔德网络的方法,并尝试使用这些网络对手写数字图像进行分类。但这些网络的特殊性质使它们在这种分类任务中相当无效。盖恩转而研究其他模式识别算法。她拿起了当时该领域的“圣经”,一本由 Richard Duda 和 Peter Hart 撰写的关于模式分类的书——我们在第 5 章中认识了 Hart,他是 Cover-Hart k 最近邻算法的共同发明人——并开始实施和对各种模式识别算法进行基准测试。

In the early 1980s, Isabelle Guyon was a young engineering student in Paris, interested in cybernetics and looking for an internship. One of her professors, Gérard Dreyfus, who later became her dissertation advisor, told her to read a paper by a physicist named John Hopfield. The paper described a novel way of constructing neural networks that could be trained to store memories. These networks, which came to be called Hopfield networks (the subject of the next chapter) and were designed for storing memories, caused a flutter among neural network researchers. As an intern, Guyon began working on them and continued the research while doing her master’s degree. She developed a more efficient method for training Hopfield networks and tried to use those networks to classify images of handwritten digits. But the peculiar nature of the networks made them rather ineffective at such classification tasks. Guyon moved on to other algorithms for pattern recognition. She picked up the “bible” of the field at the time, a book on pattern classification by Richard Duda and Peter Hart—we met Hart in chapter 5, as the co-inventor of the Cover-Hart k-nearest neighbor algorithm—and began implementing and benchmarking various pattern recognition algorithms.

Guyon 在攻读博士学位期间遇到的两个想法与她后来在贝尔实验室的工作直接相关。其中一个是最优边际分类器的想法。当她使用 Hopfield 网络和其他算法构建线性分类器时,Guyon 注意到了两位物理学家 Werner Krauth 和 Marc Mézard 的工作,他们当时在巴黎附近的高等师范学院工作。Krauth 和 Mézard 于 1987 年发表了一篇论文,展示了如何训练 Hopfield 网络,使网络存储的记忆重叠最小。他们的想法可以说是一种寻找最佳边距以分隔两个坐标空间区域的算法。他们的论文发表于 Guyon 博士论文答辩的前一年,她曾为此测试过多种线性分类算法,但这些算法都不是最佳边距分类器,这意味着这些算法找到了一些线性边界,但不一定是最好的。Guyon 本可以使用 Krauth 和 Mézard 的算法来实现最佳边距分类器,但她没有这样做。“我的博士论文考官之一问我,为什么不实现 Mézard 和 Krauth 的算法,并将其与我尝试的其他算法进行比较。我说,‘嗯,我不认为这会带来太大的不同,’”Guyon 告诉我。“但现实是,我只想毕业,我没有时间。”

Two ideas that Guyon encountered while obtaining her doctorate directly connect to her later work at Bell Labs. One was the idea of optimal margin classifiers. Even as she built linear classifiers using Hopfield networks and other algorithms, Guyon became aware of work by two physicists, Werner Krauth and Marc Mézard, who were working nearby in Paris, at the École Normale Supérieure. Krauth and Mézard had published a paper in 1987 showing how to train a Hopfield network such that the network stored memories with minimum overlap. Their idea was, in a manner of speaking, an algorithm for finding an optimal margin to separate two regions of coordinate space. Their paper was published a year before Guyon defended her Ph.D. thesis, for which she had tested numerous algorithms for linear classification—but none of these was an optimal margin classifier, meaning the algorithms found some linear boundary, not necessarily the best one. Guyon could have used Krauth and Mézard’s algorithm to implement an optimal margin classifier; she didn’t. “One of the examiners of my Ph.D. asked me why I did not implement the algorithm of Mézard and Krauth and benchmark it against the other things I was trying. I said, ‘Well, I didn’t think it would make that much of a difference,’ ” Guyon told me. “But the reality is that I just wanted to graduate, and I didn’t have time.”

因此,当 Bernhard Boser 告诉 Guyon 关于 Vapnik 的最佳边际分类器时(Boser 在 1991 年秋天实现了这个分类器),他对此产生了共鸣。Vapnik 要求 Boser 将低维线性不可分的数据推向高维,这引起了更大的共鸣。Guyon 在攻读博士学位时就曾想过这样的想法。她研究过的一篇关于这个主题的关键论文写于 1964 年,由三名俄罗斯研究人员 MA Aizerman、EM Braverman 和 LI Rozonoer 共同完成,他们与 Vapnik 在同一研究所工作,但独立于 Vapnik。

So, when Bernhard Boser told Guyon about Vapnik’s optimal margin classifier, which he, Boser, had implemented in the fall of 1991, it rang a bell. Vapnik’s request that Boser push data that was linearly inseparable in lower dimensions into some higher dimensions rang even louder bells. Guyon had come across such ideas while doing her Ph.D. One of the key papers she had studied on the subject was written in 1964, by three Russian researchers, M. A. Aizerman, E. M. Braverman, and L. I. Rozonoer, who had worked in the same institute as Vapnik, but independently of him.

在他们的论文中,这三位俄罗斯人受到物理学思想的启发,开发了一种算法,使罗森布拉特的感知器能够找到非线性边界。我们可以直奔主题,研究他们所做工作的本质,避开早期作品中存在的许多扭曲之处。回想一下,罗森布拉特的感知器算法只有在数据线性可分时才有效。下面是一个它不起作用的简单数据集:

In their paper, the Russian trio, inspired by ideas in physics, had developed an algorithm that allowed Rosenblatt’s perceptron to find nonlinear boundaries. We can cut to the chase and examine the essence of what they did, eschewing much of the contortions that imbue early works. Recall that Rosenblatt’s perceptron algorithm works only when the data is linearly separable. Here’s a simple dataset for which it wouldn’t work:

在左图中,没有办法画一条直线或一个线性分离超平面来区分圆和三角形。但是,如果我们将这些数据投影到三维空间中,使三角形高于圆(右图),我们就能找到这样一个分离超平面。

In the figure on the left, there’s no way to draw a straight line, or a linearly separating hyperplane, to demarcate the circles from the triangles. But if we project these data into three dimensions in such a way that the triangles rise above the circles (the figure on the right), we can find such a separating hyperplane.

这里有一种方法可以做到。每个原始数据点都是二维的,以特征x1x2(在本例中是两个轴上的值)和标签y为特征,标签 y 可以是 1(圆形)或 -1(三角形)。我们可以通过创建第三个特征(x 1 2 + x 2 2)将这些数据投影到三维空间中,该特征可以在 z 轴上绘制。因此,现在三维中的每个数据点都由(x 1, x 2, x 1 2 + x 2 2)表示,分别表示 x、y 和 z 轴上的值。在 3D 中绘制时,三角形高于圆形;感知器可以找到所描绘的超平面来分隔两者。

Here’s one way to do it. Each original data point is two-dimensional, characterized by features x1 and x2 (the values along each of the two axes, in this case) and a label y, which can be either 1 (circles) or -1 (triangles). We can project these data into three dimensions by creating a third feature, (x12 + x22), which can be plotted on the z-axis. So, now each data point in three dimensions is represented by (x1, x2, x12 + x22), for the values along the x, y, and z axes. When plotted in 3D, the triangles rise above the circles; a perceptron can find the depicted hyperplane to separate the two.

对于我们刚刚解决的问题,想出第三个特征来帮助将数据分成两个簇。但如果二维数据看起来像这样怎么办?

For the problem we’ve just tackled, it wasn’t very difficult to come up with a third feature to help separate the data into two clusters. But what if the 2D data look like this?

现在还不清楚第三个特征该选什么;(x 1 2 + x 2 2)不行。我们需要一种更有条理的方法将数据投影到更高维度。即使低维空间本身远高于二维(因此我们无法进行可视化),该方法也必须有效。此外,一旦我们将数据投影到更高维度,在增强空间中找到线性分离超平面就需要对高维向量进行点积,这在计算上是难以解决的。因此,不知何故,算法需要同时做两件事:(1)创建新特征,以便将数据映射到某个更高维空间,(2)避免在新空间中执行点积,并且仍然能够找到分离超平面。

Now it’s not immediately obvious what to choose for the third feature; (x12 + x22) won’t work. We need a more disciplined way to project data into higher dimensions. The method would have to work even if the lower-dimensional space were itself much higher than 2D (thus making it impossible for us to visualize). Also, once we project the data into higher dimensions, finding a linearly separating hyperplane in the augmented space involves taking the dot product of the higher-dimensional vectors, which can be computationally intractable. So, somehow, the algorithm needs to simultaneously do two things: (1) create new features such that the data can be mapped into some higher-dimensional space, and (2) avoid having to perform dot products in that new space and still be able to find the separating hyperplane.

Aizerman、Braverman 和 Rozonoer 在 1964 年的论文中展示了如何为感知器算法做到这一点。在他们对算法的重新表述中,表征超平面的权重向量被转换为构成训练数据集的向量的线性组合,而对数据点进行分类的决策规则仅取决于该数据点与训练数据集中其他每个数据点的点积。

Aizerman, Braverman, and Rozonoer showed just how to do that for the perceptron algorithm in their 1964 paper. In their reformulation of the algorithm, the weight vector characterizing the hyperplane is cast in terms of a linear combination of the vectors that make up the training dataset, and the decision rule to classify a data point depends only on the dot product of that data point with every other data point in the training dataset.

我们将探讨 Aizerman、Braverman 和 Rozonoer 提出的想法,但避免使用这三人使用的精确映射;相反,我们将使用大约十年后开发的映射,因为它更容易理解。让我们从二维数据开始,使用三个特征将其映射到三维数据。给定低维空间(在我们的例子中是 2D)中的向量x j,它将映射到高维空间(在我们的例子中是 3D)中的向量φ ( x j )。

We’ll explore the ideas put forth by Aizerman, Braverman, and Rozonoer but avoid using the exact mappings the trio used; instead, we’ll use a mapping that was developed about a decade later, as it’s simpler to understand. Let’s start with data in two dimensions and map it into data in three dimensions, using three features. Given a vector xj in the low-dimensional space (2D in our case), it gets mapped to the vector φ(xj) in the high-dimensional space (3D in our case).

xjφ ( xj )

xj φ(xj)

我们的地图是这样的:

Our map goes like this:

因此,如果二维空间中的点a由 [ a 1 a 2]给出,点b由 [ b 1 b 2]给出,则当将相同的点投影到三维空间时,将变成

So, if a point a in 2D is given by [a1 a2] and a point b is given by [b1 b2], then the same points, when projected into 3D space, become and .

为了找到线性分离超平面,我们必须对高维空间中的向量进行点积。在这个简单的例子中,对 3D 空间中的所有向量进行点积对我们来说毫不费力。但在现实世界中,增强空间的维数可能非常大,这使得计算过于耗费资源(在时间和内存需求方面)。但 Aizerman、Braverman 和 Rozonoer 向我们展示了一个很酷的技巧,可以完全避免这种复杂情况。

To find a linearly separating hyperplane, we’d have to take dot products of vectors in the higher-dimensional space. In this toy example, it’s no skin off our backs to perform dot products of all the vectors in the 3D space. But in the real world, the dimensionality of the augmented space can be humungous, making the computation far too resource-intensive (in terms of time and memory requirements). But Aizerman, Braverman, and Rozonoer showed us a cool trick that avoids this complication altogether.

重申一下,为了在高维空间中找到线性分离超平面,我们必须对ij的所有组合计算φ ( x j ) 与φ ( x i ) 的点积。

To restate, to find the linearly separating hyperplane in the higher-dimensional space, we have to calculate dot products of φ(xj) with φ(xi), for all combinations of i and j.

如果我们能用两个低维向量x ix j进行计算,得到的答案与高维空间中对应向量的点积相同,结果会怎样?如果我们能找到一个函数K,使得:

What if we could do calculations with the two lower-dimensional vectors, xi and xj, that gave us the same answer as the dot product of the corresponding vectors in the higher-dimensional space? What if we could find a function K, such that:

K ( xi , xj ) → φ ( xi ) .φ ( xj )

K (xi, xj) → φ(xi). φ(xj)

换句话说,如果我们将两个低维向量传递给函数K,那么该函数应该输出一个等于高维空间中增强向量的点积的值。让我们看一个关于向量ab的具体示例:

In other words, if we were to pass the two low-dimensional vectors to the function K, then the function should output a value that equals the dot product of the augmented vectors in the higher-dimensional space. Let’s look at a concrete example of the vectors a and b:

a = [ a1a2 ]

a = [a1 a2]

b = [ b 1 b 2 ]

b = [b1 b2]

= ( a 1 2 b 1 2 + a 2 2 b 2 2 + 2a 1 a 2 b 1 b 2 )

= (a12b12 + a22b22 + 2a1a2b1b2)

我们需要一个函数K,它产生相同的输出。这是一个这样的函数:

We need a function K, which produces the same output. Here’s one such function:

Kx,y)=(xy2

K (x, y) = (x.y)2

给这个函数输入两个低维向量ab,看看会发生什么:

Feed this function the two lower-dimensional vectors, a and b, and see what happens:

Ka,b)=(ab2

K (a, b) = (a.b)2

= ([ a 1 a 2].[ b 1 b 2]) 2

= ([a1 a2].[b1 b2])2

= ( a1b1 + a2b2 ) 2​

= (a1b1 + a2b2)2

= ( a 1 2 b 1 2 + a 2 2 b 2 2 + 2a 1 a 2 b 1 b 2 )

= (a12b12 + a22b22 + 2a1a2b1b2)

因此,K ( a,b )= φ ( a ) ( b )

So, K (a, b)= φ(a).φ(b)

这真是太惊人了。因为我们一直在研究二维和三维空间,所以这一点的重要性可能并不明显。让我们暂时想象一下,ab是 100 维向量,而φ ( a ) 和φ ( b ) 是百万维向量。如果我们能找到x jφ ( x j ) 的适当映射,使得K ( x i , x j )→ φ ( x i ). φ ( x j ),那么我们就可以计算高维向量的点积,而无需进入百万维空间;我们可以在一百维空间中进行计算。

This is stupendous. Because we have been toying with 2D and 3D spaces, the import of this may not be obvious. Let’s, for a moment, imagine that a and b are 100-dimensional vectors and that φ(a) and φ(b) are million-dimensional. If we can find the appropriate mapping of xj φ(xj), such that K (xi, xj)→ φ(xi).φ(xj), then we will be in a position to calculate the dot products of the high-dimensional vectors without ever stepping into that million-dimensional space; we can compute in one hundred dimensions.

函数K称为核函数。使用核函数计算某个高维空间中的点积,而无需将每个低维向量变形为其庞大的对应向量的方法称为核技巧。这是一个巧妙的技巧。

The function K is called a kernel function. The method of using a kernel function to compute dot products in some higher-dimensional space, without ever morphing each lower-dimensional vector into its monstrously large counterpart, is called the kernel trick. It’s one neat trick.

我们可以使用刚刚分析的映射来可视化“核化”感知器算法。让我们从二维中线性不可分的圆和三角形开始(见上文)。将每个数据点投影到三维中,然后使用感知器算法找到线性分离的超平面。对面的图显示了三维中发生的情况。

We can visualize the “kernelized” perceptron algorithm, using the mapping we’ve just analyzed. Let’s start with circles and triangles that are linearly inseparable in two dimensions (see above). Each data point is projected into three dimensions, and then the perceptron algorithm is used to find a linearly separating hyperplane. The figure opposite shows what happens in three dimensions.

您可以看到两种类型的数据点明显分开3D,允许算法找到一个平面(在本例中是任何平面)将圆与三角形分开。现在,给定 2D 中的任何新数据点,我们可以将其投影到 3D 空间中,并根据其相对于超平面的位置将其分类为圆或三角形。当将这个划定的 3D 空间投影回原始 2D 空间时,我们会得到一个将圆与三角形分开的非线性边界(见下图)。

You can see that two types of data points are clearly separated in 3D, allowing the algorithm to find a plane (in this case, any plane) that separates the circles from the triangles. Now, given any new data point in 2D, we can project it into 3D space and classify it as a circle or a triangle, depending on its position relative to the hyperplane. When this delineated 3D space is projected back into the original 2D space, we get a nonlinear boundary that separates the circles from the triangles (see figure, below).

Guyon 在博士论文期间就研究过核函数,甚至在开始工作后也一直在研究核函数。具体来说,她一直在使用一种叫做多项式核,由麻省理工学院计算神经学家 Tomaso Poggio 于 1975 年提出。这是多项式核的一般形式:

Guyon had played around with kernels during her Ph.D. thesis, and she did so even after she started working. In particular, she had been using something called a polynomial kernel, introduced by MIT computational neuroscientist Tomaso Poggio in 1975. Here’s the general form of the polynomial kernel:

K ( x, y ) = ( c + xy ) d,其中cd是常数

K (x, y) = (c + x.y)d, where c and d are constants

如果将cd 的常数分别选择为 0 和 2,那么您将得到我们刚刚用于感知器算法的内核:

If you choose the constants to be 0 and 2 for c and d, respectively, you’ll get back the kernel we just used for the perceptron algorithm:

Kx,y)=(xy2

K (x, y) = (x.y)2

让我们再尝试一种变化,其中常数为 1 和 2,以更好地理解其工作原理。

Let’s play around with one more variation, where the constants are 1 and 2, to get a better sense of why this works.

Kx,y)=(1 + xy2

K (x, y) = (1 + x.y)2

对于二维数据点:

For 2D data points:

a = [ a1a2 ]

a = [a1 a2]

b = [ b 1 b 2 ]

b = [b1 b2]

我们有:

We have:

K ( a, b ) = (1 + [ a 1 a 2].[ b 1 b 2]) 2
K (a, b) = (1 + [a1 a2].[b1 b2])2
= (1 + a 1 b 1 + a 2 b 2) 2
= (1 + a1b1 + a2b2)2
= 1 + ( a 1 b 1 + a 2 b 2) 2 + 2 ( a 1 b 1 + a 2 b 2)
= 1 + (a1b1 + a2b2)2 + 2 (a1b1 + a2b2)
= 1 + a 1 2 b 1 2 + a 2 2 b 2 2 + 2 a 1 a 2 b 1 b 2 + 2 a 1 b 1+ 2 a 2 b 2
= 1 + a12 b12 + a22 b22 + 2a1a2b1b2 + 2a1b1+ 2a2b2

现在的问题是:映射x jφ ( x j ) 应该是什么,使得:

The question now is this: What should the mapping xjφ(xj) be, such that:

K ( xi , xj ) → φ ( xi ) .φ ( xj )

K (xi, xj) → φ(xi).φ(xj)

稍微摆弄一下,我们就能发现映射。(如果不太明显,也不用担心。这并不简单;事实上,找到这样的映射是一门艺术。)

With a little bit of fiddling around, we can discover the mapping. (Don’t worry if it’s not immediately obvious. This is not trivial; in fact, finding such mappings is quite the art.)

xjφ ( xj )

xj φ(xj)

(为了清楚起见,我添加了逗号来分隔较大向量的各个元素;通常不会使用逗号。)

(I have added commas to separate the elements of the larger vector, just for clarity; normally, one wouldn’t use commas.)

所以:

So:

我们将二维坐标或向量ab转换为六维坐标φ ( a ) 和φ ( b )。问题是:在六维空间中求得的φ ( a )。φ ( b是否与在二维空间中对ab进行核函数运算的结果相同?让我们检查一下

We have the 2D coordinates, or vectors, a and b transformed into their 6D counterparts, φ(a) and φ(b). The question is: Does φ(a).φ(b), evaluated in the 6D space, give the same result as the kernel function operating on a and b in 2D space? Let’s check:

φa)。φ b =

φ(a).φ(b) =

= 1 + a 1 2 b 1 2 + a 2 2 b 2 2 + 2 a 1 a 2 b 1 b 2 + 2 a 1 b 1 + 2 a 2 b 2

= 1 + a12b12 + a22b22 + 2a1a2b1b2 + 2a1b1 + 2a2b2

= Ka,b

= K (a,b)

瞧!它们是一样的。因此,核函数让我们可以计算 6D 向量的点积,但我们永远不必在 6D 空间中制定这些向量或进行计算。对于我们的多项式核,我们使用了常数值 1 和 2。您可以验证核函数是否适用于任何常数值,因此我们可以将数据投影到更高的维度中,在那里我们越来越有可能找到线性分离的超平面。

Voilà! They are the same. So, the kernel function lets us calculate the dot product of 6D vectors, but we never have to formulate those vectors, or compute, in 6D space. For our polynomial kernel, we used the constant values of 1 and 2. You can verify that the kernel function works for any value of the constants, and so we can project the data into ever-higher dimensions, where we are more and more likely to find a linearly separating hyperplane.

(题外话:高维空间的大小由 给出n是原始低维空间的大小,d是多项式核中使用的常数的值。另外:为什么使用线性分类器或进行线性回归如此重要?)

(An aside: The size of the higher-dimensional space is given by ; n is the size of the original, low-dimensional space, and d is the value of the constant used in the polynomial kernel. Another aside: Why is it so important to work with linear classifiers or to do linear regression?)

Guyon 知道这些核,但她没有将处理线性可分数据的最佳边际分类器与可以使用核技巧在更高维度上发挥魔力的线性分类器(如感知器算法)联系起来。这得等到 1991 年,当时她的丈夫告诉她 Vapnik 的想法,即通过创建新特征并在高维空间中构建最佳边际分类器,将数据推向更高维度。这些特征可以通过将单个特征相乘来生成,例如:

Guyon knew of these kernels, but she hadn’t connected the dots between optimal margin classifiers that worked with linearly separable data and linear classifiers, like the perceptron algorithm, that could use the kernel trick to do their magic in higher dimensions. That would have to wait until 1991, when her husband told her about Vapnik’s idea of pushing data into higher dimensions by creating new features and building an optimal margin classifier in the high-D space. These features could be generated by multiplying individual features, for example:

xjφ ( xj )

xj φ(xj)

从 2D 到 3D:[ x 1 x 2] = [ x 1 x 2 x 1 x 2]

From 2D to 3D: [x1 x2] = [x1 x2 x1x2]

或者,从 3D 到 7D:

Or, from 3D to 7D:

[ x 1 x 2 x 3] = [ x 1 x 2 x 3 x 1 x 2 x 1 x 3 x 2 x 3 x 1 x 2 x 3]

[x1 x2 x3] = [x1 x2 x3 x1x2 x1x3 x2x3 x1x2x3]

Vapnik 希望 Bernhard Boser 通过这种方式创建新特征来实现该算法,然后明确地执行点高维空间中的积。那天早上,当 Boser 在上班的路上向 Guyon 提到这一点时,她立即意识到这种算法的徒劳性。“我告诉 Bernhard,没有必要将这些特征 [点] 积相加,”她说道。“我们来使用核技巧吧。”她向丈夫解释了这个技巧,然后快速重写了 Vapnik 的最佳边际算法以使用核。“我以一种显而易见的方式重写了它。凡是 [在高维中] 有点积的地方,我都用核替换它。一旦你明白了这一点,它就很简单了,”她说道。

Vapnik wanted Bernhard Boser to implement the algorithm by creating new features in this manner and then explicitly do the dot products in the higher-dimensional space. When Boser mentioned this to Guyon that morning on their way to work, she immediately saw the futility of such an algorithm. “I told Bernhard, there’s no point in making these [dot] products of features,” she said. “Let’s use the kernel trick.” She explained the trick to her husband and then quickly rewrote Vapnik’s optimal margin algorithm to use a kernel. “I rewrote it in a way that made it apparent how to kernelize it. Everywhere there was a dot product [in higher dimensions], I replaced it by a kernel. Once you see that, it’s trivial,” she said.

我们之前看到,使用核找到的超平面然后投影回二维可能看起来像这样:

We saw earlier that the hyperplane found using a kernel and then projected back into two dimensions might look like this:

即使是粗略的视觉检查也会发现边界太靠近圆圈,而且很容易将新点误分类为三角形,而实际上它是一个圆圈。解决方案是在更高维度中使用最佳边缘分类器,而不是感知器。该算法将在更高维度中找到支持向量和穿过无人地带中间的适当超平面。投影回 2D,您将得到这个新边界:

Even a cursory visual inspection suggests that the boundary is too close to the circles, and it’d be easy to misclassify a new point as a triangle when in fact it is a circle. The solution would be to use an optimal margin classifier in the higher dimensions, rather than a perceptron. The algorithm would find support vectors in the higher dimensions and the appropriate hyperplane running through the middle of no-one’s-land. Projected back to 2D, you’d get this new boundary:

现在,与感知器的次优决策边界相比,正确分类新数据点的机会更高。

Now the chances of correctly classifying new data points are higher compared with those for the perceptron’s sub-optimal decision boundary.

最初,Guyon 和 Vapnik 争论内核技巧是否重要。而 Boser 则更注重实际。““这是对代码的一个非常简单的更改。所以,趁着他们还在争论的时候,我就实施了它,”博瑟告诉我。

Initially, Guyon and Vapnik argued over whether the kernel trick was important. Boser, meanwhile, was more practical. “This was a very simple change to the code. So, I just implemented it, while [they] were still arguing,” Boser told me.

盖恩随后向瓦普尼克展示了杜达和哈特关于模式分类的书中的一个脚注。它提到了核技巧以及数学家理查德·库朗和大卫·希尔伯特的工作,他们是各自领域的巨人。据盖恩说,这让瓦普尼克信服了。“他说,‘哇,这是一件大事,’”盖恩说。

Guyon then showed Vapnik a footnote from the Duda and Hart book on pattern classification. It referred to the kernel trick and to the work of mathematicians Richard Courant and David Hilbert, giants in their fields. According to Guyon, this convinced Vapnik. “He said, ‘Oh, wow, this is something big,’ ” Guyon said.

提到希尔伯特尤其吸引人。所谓的希尔伯特空间允许无限维向量。有一个内核,称为径向基函数 (RBF) 核,将K ( a, b ) 等同于两个对应的高维向量φ ( a ) 和φ ( b )的点积,尽管不可能从低维向量映射到高维空间(例如aφ ( a ))。这是因为高维空间是无限的。然而,我们可以计算出K ( a,b )。这一点的意义相当惊人:只要对数据集给出一些简单的假设,就可以将问题转化为无限维中的线性可分问题,而不管低维决策边界的复杂性如何。对于那些好奇的人来说,这是核函数:

The reference to Hilbert is particularly enticing. So-called Hilbert spaces allow for infinite-dimensional vectors. There’s a kernel, called the radial basis function (RBF) kernel, that equates K (a, b) with the dot product of two corresponding high-dimensional vectors φ(a) and φ(b), even though it’s impossible to do the mapping from the low-dimensional vector to the high-dimensional space (aφ(a), for example). That’s because the higher-dimensional space is infinite. And yet, one can calculate K (a, b). The import of this is rather stunning: Given some simple assumptions about the dataset, it’s possible to turn the problem, regardless of the complexity of the decision boundary in lower dimensions, into a linearly separable problem in infinite dimensions. For the curious, here’s the kernel function:

RBF 核是““布拉德·皮特 (Brad Pitt) 是内核的代表人物,”温伯格在机器学习讲座中开玩笑说。它太完美了,“人们看到它时有时会晕倒。”

The RBF kernel is the “Brad Pitt of kernels,” Weinberger joked in his lectures on machine learning. It’s so perfect, “people sometimes faint when they see it.”

要理解 Weinberger 对 RBF 核的赞颂,需要先了解决策边界。通过某些坐标空间(例如 2D 或 3D)的决策边界可以看作是一个函数。例如,我们之前使用核化感知器算法找到的非线性决策边界相当于找到一个函数,该函数接受两个值x1x2,并帮助我们绘制一条类似于决策边界的曲线。这些边界可能非常复杂,尤其是当输入空间本身的维度远高于我们一直在可视化的两维或三维时,以及当两类数据比我们迄今为止看到的更加混合时。RBF 核可以帮助算法始终在某个无限维空间中找到线性可分的超平面,因此当映射到低维空间时,它可以找到任何决策边界或函数,无论多么复杂。它被称为“通用函数逼近器”。记住这句话,因为当我们讨论某些类型的人工神经网络如何也是通用函数逼近器时,我们将用整整一章来讨论它;只要有足够的神经元,它们就可以解决任何问题。

Understanding Weinberger’s paean to the RBF kernel demands a segue into decision boundaries. A decision boundary through some coordinate space (such as 2D or 3D) can be thought of as a function. For example, the nonlinear decision boundary we found earlier, using the kernelized perceptron algorithm, is equivalent to finding a function that takes in two values, x1 and x2, and helps us draw a curve that resembles the decision boundary. These boundaries can be extremely complex, especially when the input space itself is considerably higher dimensional than the two or three we have been visualizing and when the two classes of data are way more mixed than we have seen thus far. An RBF kernel, because it can help an algorithm always find a linearly separable hyperplane in some infinite-dimensional space, can find any decision boundary, or function, when mapped to the lower-dimensional space, no matter how complex. It’s said to be a “universal function approximator.” Keep this phrase in mind, for we’ll devote an entire chapter to it when we discuss how certain types of artificial neural networks are also universal function approximators; given enough neurons, they can solve any problem.

Vapnik 于 1964 年提出的最佳边际分类器与核技巧的结合被证明非常强大。没有数据集是禁忌现在。不同类别的数据在原始低维空间中如何混合并不重要。人们可以将数据投影到极高的维度中,然后使用最佳边际分类器来找到最佳线性分离超平面,但使用核函数来做到这一点,从而避免在高维空间中进行计算。

The combination of Vapnik’s 1964 optimal margin classifier and the kernel trick proved incredibly powerful. No dataset was off-limits now. It didn’t matter how intermingled the different classes of data were in the original low-dimensional space. One could project the data into extremely high dimensions and then use the optimal margin classifier to find the best linearly separating hyperplane, but do so using the kernel function, and thus avoid ever having to compute in the high-dimensional space.

Boser 完成了算法的大部分实现和测试工作,特别是在当时最受欢迎的数据集上:修改后的国家标准与技术研究所 (MNIST) 手写数字数据库。与此同时,Guyon 主动撰写了一篇论文,提交给年度计算学习理论 (COLT) 会议。她在贝尔实验室的同事和朋友,特别是 Esther Levin 和 Sara Solla,都为这个享有盛誉的会议撰写了论文。“有人认为,如果你在 COLT 上发表了论文,你就是一个真正的机器学习人,因为这是一个硬核理论会议,”Guyon 说。

Boser did most of the work implementing and testing the algorithm, particularly on the favorite dataset of the time: the Modified National Institute of Standards and Technology (MNIST) database of handwritten digits. Meanwhile, Guyon took it upon herself to write a paper to be submitted to the annual Computational Learning Theory (COLT) conference. Her colleagues and friends at Bell Labs, particularly Esther Levin and Sara Solla, had written papers for this prestigious conference. “It was kind of thought [that] if you have a paper in COLT, you are a real machine learning person, because it’s a hard-core theory conference,” Guyon said.

Guyon 和 Boser 去见了 COLT 会议组织者 Manfred Warmuth 和 David Haussler。“我们问他们是否认为这篇论文适合在那次会议上发表。我们做了一个演讲,然后他们告诉我们,‘哦,是的,我们喜欢应用论文。’对我们来说,这是我们做过的最具理论性的工作,而对他们来说,这是一篇应用论文,”Guyon 笑着告诉我,回忆起这件事。

Guyon and Boser went to meet the COLT conference organizers, Manfred Warmuth and David Haussler. “We asked them whether they thought it was a paper that would fit into that conference. We gave a talk, and then they told us, ‘Oh, yeah, we like application papers.’ For us, it was the most theoretical work that we ever did, and for them, it was an application paper,” Guyon told me, laughing at the memory.

Warmuth 回忆起 Guyon 和 Boser 曾与他接触过。他和 Haussler 被这项技术的简单性所吸引,“当然是用 KERNEL TRICK !!!”Warmuth 在一封电子邮件中写道。他们接受了这篇题为“最佳边距分类器的训练算法”的论文,并且它出现于 1992 年 7 月的第五届计算学习理论年度研讨会论文集上。

Warmuth recalled being approached by Guyon and Boser. He and Haussler were fascinated by the simplicity of the technique and “of course by the KERNEL TRICK!!!” Warmuth wrote me in an email. They accepted the paper, titled “A Training Algorithm for Optimal Margin Classifiers,” and it appeared in July 1992 in the Proceedings of the Fifth Annual Workshop on Computational Learning Theory.

这篇论文花了十年时间,但最终成为了经典。与此同时,贝尔实验室的其他人也进一步推进了这项工作。克里斯汀·贝内特的博士学位论文是关于数学规划的,即使用数学模型解决问题,在这里指的是机器学习问题,这启发了 Vapnik 和 Corinna Cortes(一位当时在贝尔实验室工作的丹麦数据科学家,现在是谷歌研究部门的副总裁),开发出了一种他们所谓的“软边缘”分类器。这种方法于 1995 年发表,它考虑了有问题的数据点,否则即使在高维空间中也不可能找到线性分离的超平面。

It took a decade, but the paper eventually became a classic. Meanwhile, others at Bell Labs furthered the work. Kristin Bennett’s Ph.D. work on mathematical programming, which refers to the use of mathematical models to solve problems, in this case ML problems, inspired Vapnik and Corinna Cortes, a Danish data scientist who was then at Bell Labs and is now a VP at Google Research, to develop what they called a “soft-margin” classifier. This approach, published in 1995, made allowances for problematic data points that would otherwise have made it impossible to find a linearly separating hyperplane, even in the higher-dimensional space.

Vapnik 和 Cortes 将他们的算法称为支持向量网络。德国计算机科学家 Bernhard Schölkopf 现在是德国图宾根马克斯普朗克智能系统研究所所长,他将其重新命名为“支持向量机”,以帮助将该算法与神经网络区分开来。从此,“支持向量机”(SVM)进入了机器学习的词汇表。

Vapnik and Cortes called their algorithm the support vector network. Bernhard Schölkopf, a German computer scientist, who is now the director of the Max Planck Institute for Intelligent Systems, in Tübingen, Germany, renamed it by coining the term “support vector machine,” helping distinguish the algorithm from neural networks. Thus did “support vector machine” (SVM) enter the lexicon of machine learning.

有了数学知识,我们就可以相当简单地回顾一下 SVM 的作用:它们获取在原始相对低维空间中线性不可分的数据集,并将这些数据投影到足够高的维度中,以找到最佳的线性分离超平面,但寻找超平面的计算依赖于核函数,这些核函数使算法牢牢锚定在更易于计算的低维空间中。我们看到,支持向量指的是那些位于无人地带边缘的数据点。该技术不会在高维中找到任何旧的超平面;它会找到一个最佳的超平面。当投影回低维时,这个超平面可能类似于一个非常复杂但仍然是最佳的决策边界。

With the math behind us, it’s rather simple to recap what SVMs do: They take datasets that are linearly inseparable in their original, relatively low-dimensional space and project these data into high enough dimensions to find an optimal linearly separating hyperplane, but the calculations for finding the hyperplane rely on kernel functions that keep the algorithm firmly anchored in the more computationally tractable lower-dimensional space. Support vectors, we saw, refer to those data points that lie on the margins of no-one’s-land. The technique doesn’t find any old hyperplane in the higher dimensions; it finds an optimal one. When projected back to lower dimensions, this hyperplane can resemble a very convoluted but nonetheless optimal decision boundary.

SVM 开始流行,在 20 世纪 90 年代和 21 世纪的大部分时间里成为机器学习社区的宠儿。虽然 Guyon 在重新设计 Vapnik 的最佳边际分类器以纳入核技巧方面发挥了重要作用,但她认为 Vapnik 认识到了核化 SVM 的强大功能,并确保更广泛的社区理解它。“我并没有真正意识到这是一项重要的发明,”她说。“对我来说,这只是一个技巧,一个计算技巧,它给出了非常好的结果。”

SVMs took off, becoming the darlings of the ML community through much of the 1990s and 2000s. While Guyon had been instrumental in redesigning Vapnik’s optimum margin classifier to incorporate the kernel trick, she credits Vapnik for recognizing the power of kernelized SVMs and for making sure the wider community understood it. “I didn’t really recognize that this was an important invention,” she said. “For me it was just a trick, a computational trick, and it was giving very nice results.”

鉴于 Vapnik 的宣传,更广泛的机器学习社区直到最近才承认 Guyon 和 Boser,他们的贡献在某种程度上被 Vapnik 的地位所掩盖。即使没有 SVM,Vapnik 也为机器学习理论做出了巨大贡献。“曼弗雷德·沃穆特和我从一开始就赋予了瓦普尼克一种超级明星的地位,”大卫·豪斯勒在一封电子邮件中写道。豪斯勒和沃穆特创造了“瓦普尼克-切尔沃年基斯 (VC) 维度”,该维度使用瓦普尼克和数学家阿列克谢·切尔沃年基斯共同开发的数学来衡量机器学习模型正确分类数据的能力。“此后,在 AT&T 贝尔实验室的帮助下,瓦普尼克被视为终极大师,出现了一种说法,即瓦普尼克是机器学习新兴革命背后的真正天才。我知道他个人对此感到有点不知所措,”豪斯勒在电子邮件中写道。

Given Vapnik’s evangelism, it’s only relatively recently that the broader ML community acknowledged Guyon and Boser, whose contributions had been somewhat overshadowed, as it were, by Vapnik’s stature. Even without SVMs, Vapnik had contributed heavily to the theory of machine learning. “Manfred [Warmuth] and I gave Vapnik a kind of superstar status from the beginning,” David Haussler wrote me in an email. Haussler and Warmuth had coined the term “Vapnik-Chervonenkis (VC) dimension,” which uses math developed by Vapnik and fellow mathematician Alexey Chervonenkis to provide a measure of an ML model’s capacity to classify data correctly. “After that, with the help of AT&T Bell Labs, where he was treated as the ultimate guru, there emerged a narrative that Vapnik was the true genius behind the emerging revolution in machine learning. I know he personally felt a bit overwhelmed by it,” Haussler wrote in his email.

就 SVM 而言,这个故事情节正在发生变化。2020 年,BBVA 基金会知识前沿奖颁给了 Isabelle Guyon、Bernhard Schölkopf 和 Vladimir Vapnik;Guyon 和 Vapnik 发明了支持向量机,Schölkopf 则利用核方法进一步增强了其威力。Haussler 写信给我说:“人类倾向于采用单一英雄的简单叙事,但事实上现实要复杂得多。”

As far as SVMs go, this storyline is changing. In 2020, the BBVA Foundation gave its Frontiers of Knowledge Award to Isabelle Guyon, Bernhard Schölkopf, and Vladimir Vapnik; to Guyon and Vapnik for inventing support vector machines and to Schölkopf for furthering their power with the use of kernel methods. “The human tendency [is] to adopt a simple narrative with a single hero when in fact the reality is quite a bit more complex,” Haussler wrote to me.

BBVA 基金会的引文继续写道:“得益于 SVM 和核方法,智能机器现在可以被训练来以人类的精度对数据集进行分类,有时甚至更好,使它们能够识别从声音、笔迹或面孔到癌细胞或信用卡欺诈使用等一切内容。SVM 现在被用于基因组学、癌症研究、神经病学、诊断成像,甚至艾滋病毒药物鸡尾酒优化,并在气候研究、地球物理学和天体物理学中找到多种应用。”

The BBVA Foundation’s citation goes on to say, “Thanks to SVM and kernel methods, intelligent machines can now be trained to classify datasets with human precision, or at times even better, enabling them to recognize everything from voices, handwriting or faces to cancer cells or the fraudulent use of credit cards. SVMs are now being used in genomics, cancer research, neurology, diagnostic imaging, and even in HIV drug cocktail optimization, as well as finding diverse applications in climate research, geophysics and astrophysics.”

正如我们将在接下来的章节中看到的那样,随着第一个人工智能寒冬的解冻,人工神经网络的研究在 20 世纪 80 年代开始蓬勃发展。但 1992 年 SVM 和核方法的突然出现暂时阻碍了神经网络的发展。“这就像 Vapnik 拿着一束大光向所有人展示:‘看,这是一个应用核技巧的机会,’”Guyon 说。然后 Schölkopf 和他的同事 Alex Smola 写道一本关于核方法的综合性书籍,阐述了核技巧可以做很多事情。“就是这样,”Guyon 说。“这太不可思议了。神经网络在 80 年代主导了机器学习。而在 90 年代,突然之间,每个人都转向了核方法。”

As we’ll see in upcoming chapters, with the thawing of the first AI winter, research in artificial neural networks began burgeoning in the 1980s. But the abrupt arrival of SVMs and kernel methods in 1992 derailed the advance of neural networks for a while. “It’s like Vapnik took a big beam of light and showed it to everybody: ‘Look, there is this opportunity of applying this kernel trick,’ ” Guyon said. And then Schölkopf and his colleague Alex Smola wrote a comprehensive book on kernel methods, illustrating much of what one could do with the kernel trick. “And that was it,” Guyon said. “It was incredible. Neural networks dominated machine learning in the eighties. And in the nineties, all of a sudden, everybody switched to kernel methods.”

现在,神经网络再次主导着现代机器学习。有趣的是,理论进步开始显示出神经网络和核机器之间的诱人联系。为了理解这种联系,我们需要回溯到大约十年前,即 20 世纪 80 年代初,当时机器学习社区开始认真关注神经网络——这要归功于 John Hopfield 的工作,他的同名网络启发了 Guyon 进军机器学习。Hopfield 是一位物理学家,当时正处于职业生涯的十字路口,希望从事一些大事。他无法想象自己会激发什么样的变化。

Now neural networks are dominating modern machine learning again. Intriguingly, theoretical advances are beginning to show tantalizing links between neural networks and kernel machines. To appreciate this connection, we need to step back a decade or so, to the early 1980s, when the ML community started paying serious attention to neural networks—thanks to the work of John Hopfield, whose eponymous networks inspired Guyon’s foray into machine learning. Hopfield, a physicist, was at a crossroads in his career and was looking to work on something big. He could not have imagined the changes he would inspire.

第八章

CHAPTER 8

借助物理学

With a Little Help from Physics

20 世纪 70 年代末,普林斯顿大学物理学家约翰·霍普菲尔德面临着一个常见的职业问题:“现在怎么办?”他正在寻找新的研究方向。这个困境对他来说并不新鲜。霍普菲尔德在 20 世纪 60 年代为固态和凝聚态物理学做出了开创性的贡献。但到了 60 年代末,他已经精疲力竭。他找不到有趣的问题来应用他的“特殊才能”。于是,他转向生物学,专注于细胞生化反应,例如参与蛋白质合成的反应。霍普菲尔德研究了转移 RNA,即 tRNA,这种分子可以“识别”正确的氨基酸并将它们带到细胞中的蛋白质合成位点。生命的方方面面都依赖于这个过程的无差错进行。但鉴于生物过程很容易出错,生物学如何才能做到正确呢?当时,生物化学家倾向于将一个过程简单地视为从步骤 A 到 B 再到 C。霍普菲尔德意识到从 A 到 C 有多条途径,而采取其中一条途径的能力对于减少错误是必要的,这就允许系统进行校对。 “霍普菲尔德告诉我:“如果你不校对,你就无法让事情变得足够无错误,因为[生物]硬件还不够完美。”

In the late 1970s, Princeton University physicist John Hopfield faced a familiar career question: “Now what?” He was looking for a new research direction. The dilemma wasn’t new for him. Hopfield had spent the 1960s making seminal contributions to solid-state and condensed matter physics. But by the end of that decade, he had run out of steam. He couldn’t find interesting problems to which to apply his “particular talents.” So, he turned to biology, focusing on cellular biochemical reactions, such as those involved in the synthesis of proteins. Hopfield looked at transfer RNA, or tRNA, molecules that “recognize” the correct amino acids and bring them to the site of protein synthesis in cells. Every aspect of life depends upon this process proceeding without errors. But given how error-prone biological processes can be, how does biology get it right? Biochemists, at the time, tended to think of a process as simply going from step A to B to C. Hopfield realized that there were multiple pathways from A to C and that the ability to take one of many pathways was necessary for reducing errors, allowing the system to proofread, as it were. “You can’t make things error-free enough to work if you don’t proofread, because the [biological] hardware isn’t nearly perfect enough,” Hopfield told me.

霍普菲尔德于 1974 年发表了他的“生物学”论文。“这是我写的第一篇包含“核苷”或“‘合成酶’或者‘异亮氨酸’,甚至是‘GTP’,”他后来写道。1976 年,霍普菲尔德在哈佛大学发表了关于校对的想法,并预测了生物化学家在某些反应中应该看到的分子数量和比例。就他而言,这些都是理论预测,尚未通过实验证实。演讲结束后,一位听众告诉霍普菲尔德,他在细菌研究中看到了这种“化学计量”比例。研究人员发现,抗生素链霉素会干扰细菌的校对能力;结果,细菌合成了大量错误且功能致命的蛋白质,导致它们死亡。霍普菲尔德对他的理论工作得到经验验证感到非常高兴。这是我科学生涯中最大的、也是最令人高兴的惊喜之一,”他写道。

Hopfield published his “biology” paper in 1974. “This was the first paper I had ever written containing words like ‘nucleoside’ or ‘synthetase’ or ‘isoleucine’ or even ‘GTP,’ ” he would later write. In 1976, Hopfield gave a talk at Harvard about his ideas on proofreading and predicted what biochemists should see for certain reactions in terms of the quantities and ratios of molecules. As far as he was concerned, these were theoretical predictions and yet to be corroborated by experiments. After the talk, a scientist in the audience told Hopfield that he had seen such “stoichiometry” ratios in his studies of bacteria. Researchers had found that the antibiotic streptomycin interferes with the bacteria’s ability to proofread; as a result, the bacteria synthesize numerous erroneous and functionally lethal proteins, causing them to die. Hopfield was elated at the empirical validation of his theoretical work. “It was one of the biggest—and most delightful—surprises of my scientific career,” he wrote.

除了这一重大时刻之外,这次生物学尝试为霍普菲尔德的另一项根本性贡献奠定了基础,这次贡献是在计算神经科学领域——或者从更广泛的角度来看,是在机器学习和人工智能领域。1974 年的论文清楚地阐明了这样一种观点:反应网络(在这种情况下,由从同一起点到同一终点的多个分子通路组成)的功能超出了通过观察单个分子所能理解的范围。“霍普菲尔德写道:“网络可以‘解决问题’,或者发挥超出单个分子和线性路径能力的功能。六年后,我开始在思考神经元网络而不是单个神经元的特性时推广这一观点。”

That momentous occasion aside, this foray into biology set the stage for another of Hopfield’s fundamental contributions, this time in computational neuroscience—or, to take a broader view of it, in machine learning and AI. The 1974 paper was a clear elucidation of the idea that networks of reactions (in this case, consisting of multiple molecular pathways from the same starting point to the same end point) had functions that went beyond what could be understood by looking at individual molecules. “A network could ‘solve a problem’ or have a function that was beyond the capability of a single molecule and a linear pathway,” Hopfield wrote. “Six years later I was generalizing this view in thinking about networks of neurons rather than the properties of a single neuron.”

但在开始这项工作之前,霍普菲尔德必须确定要解决的“问题”(他用大写字母来强调这个问题必须非常重大)。这花了一段时间。一次偶然的机会,霍普菲尔德被邀请参加麻省理工学院神经科学研究计划的半年会议,该会议距离普林斯顿东北部约五小时车程,这给了他必要的刺激。“心智如何出现在我看来,大脑的分裂是人类最深层次的问题。这绝对是个问题,”他写道。参加这些会议时,他清楚地意识到神经科学家所做的工作——““灵长类神经解剖学、昆虫飞行行为、发育不全的电生理学、大鼠海马的学习、阿尔茨海默病、钾通道、人类语言处理”——都在探索自己领域的特殊角落。霍普菲尔德追求的是更具综合性的东西,需要他所擅长的工具:理论物理学。具体来说,他正在寻找一种基本的、可能影响深远的关于大脑如何计算的见解。

But before he could embark on this work, Hopfield had to identify a “PROBLEM” to work on (the all caps being his way of emphasizing that the problem had to be substantial). This took a while. A chance invitation to attend semi-annual meetings of the Neuroscience Research Program at MIT, about five hours northeast from Princeton, gave Hopfield the necessary stimulus. “How mind emerges from brain is to me the deepest question posed by our humanity. Definitely A PROBLEM,” he wrote. It became clear to him, while he was attending those meetings, that the work the neuroscientists were doing—“primate neuroanatomy, insect flight behavior, electrophysiology in aplasia, learning in rat hippocampus, Alzheimer’s disease, potassium channels, human language processing”—was exploring their own special corners of the field. Hopfield was after something more integrative, something that required the tools of his trade: theoretical physics. Specifically, he was searching for a basic and potentially far-reaching insight into how the brain computes.

机器通过将配置从一个“状态”更改为另一个“状态”来进行计算(根据程序员指定的某些既定规则),直到机器最终达到最终状态。最终状态代表可以读取的解决方案。因此,计算机是一个动态系统,其行为可以看作是随着时钟的每一次滴答而演变或从一个状态转换到另一个状态,并具有指定状态转换和允许状态集的规则,即所谓的状态空间。“我在 [NRP 会议上] 已经看到足够多的东西,可以理解如果你能描述神经活动如何从一个神经元传播到另一个神经元的方程式,你就拥有了一个动态系统,”霍普菲尔德告诉我。“所有计算机都是动态系统。神经生物学和数字或模拟计算机之间必须存在这种联系。”

A machine computes by changing configurations from one “state” to another (according to some prescribed rules specified, say, by a programmer), until eventually the machine reaches an end state. The end state represents a solution that can be read off. A computer, then, is a dynamical system, one whose behavior can be seen as evolving, or transitioning from state to state, with each tick of the clock, with rules that specify the state transitions and the set of allowable states, the so-called state space. “I’d seen enough [at the NRP meetings] to understand that if you could describe the equations of how neural activity propagated from one neuron to another, you had a dynamical system,” Hopfield told me. “All computers are dynamical systems. There had to be this link between neurobiology and digital or analog computers.”

此外,霍普菲尔德在生化过程校对方面的工作也证明了,可以通过状态空间采取多种途径“收敛”到相同最终状态的动态系统可以减少计算过程中积累的错误。霍普菲尔德一直在寻找一个可以采用这种解决方案的神经生物学问题。他最终想到了一个:联想记忆。这个术语对我们大多数人来说可能很神秘,但它是我们直观熟悉的东西。想想一首歌的旋律或一个暗示是如何一小口香气就能让我们想起生活中的整个场景。我们的大脑能够以某种方式利用原始体验的片段将整个存储的记忆带入意识中。这就是联想记忆。霍普菲尔德正在研究联想记忆的计算模型。存储了一些记忆的人工神经元网络是否能够仅凭其中的一小部分就检索出特定记忆?他能否从网络收敛到解决方案的动态角度解决这个问题?“我花了很长时间才在神经科学中找到与该计算类似的问题,最后我意识到联想记忆就是这样的,”霍普菲尔德说。

Also, Hopfield’s work on proofreading in biochemical processes was evidence that dynamical systems that could take multiple pathways through the state space to “converge” to the same final state could reduce the errors that accumulate during computation. Hopfield kept looking for a neurobiological problem that was amenable to such a solution. He finally hit upon one: associative memory. The term may be cryptic to most of us, but it’s something with which we are intuitively familiar. Think about how the strains of a song or the hint of an aroma can bring to mind an entire episode from our lives. Our brains are somehow able to use a fragment of the original experience to bring into conscious awareness an entire stored memory. Well, that’s associative memory. Hopfield was after a computational model of the same. Could a network of artificial neurons that had stored within them some memories be capable of retrieving a particular memory given only some sliver of it? Could he solve the problem in terms of the dynamics of the network converging to a solution? “It took me quite a while to find any problem in neuroscience which looked like that computation, and I finally realized associative memory did,” Hopfield said.

要理解这种计算的本质,需要深入研究物理学,这次是铁磁性物理学及其简化的数学模型。这与神经元计算的相似之处令人惊叹。

Understanding the essence of such a computation requires delving into physics, this time the physics of ferromagnetism and a simplified mathematical model of it. The parallels to computing with neurons are striking.

拖鞋

FLIP-FLOP

一些简单的数学运算可以将看似不同的现象联系在一起,例如窗玻璃的制造过程、材料的磁化以及某些类型的神经网络(至少是人工神经网络)的运作。

Some simple math connects phenomena as seemingly diverse as the process that gives us window glass, the magnetization of materials, and the workings of some types of neural networks, at least the artificial kind.

让我们从窗玻璃开始。制造这种玻璃的一种方法是从原材料开始——通常是二氧化硅(沙子)、苏打灰和石灰石,二氧化硅是主要成分。将混合物熔化形成熔融玻璃,然后倒入“浮法浴”中。浮法浴使平板玻璃具有平整度,并有助于将熔融材料的温度从 1,000°C 以上冷却到 600°C 左右。这种扁平材料进一步“退火”,这一过程释放了玻璃中积累的任何应力。对于我们的目的而言,关键在于,最终的玻璃既不是具有有序晶体结构的固体,也不是液体。相反,它是一种非晶态固体,其中材料的原子和分子不符合晶格的规律。

Let’s start with window glass. One method for making such glass is to start with the raw materials—usually silica (sand), soda ash, and limestone, with silica being the primary component. The mixture is melted to form molten glass and then poured into a “float bath.” The bath gives plate glass its flatness and helps cool the molten material from temperatures of over 1,000°C down to about 600°C. This flat material is further “annealed,” a process that releases any accumulated stresses in the glass. The key, for our purposes, is that the resulting glass is neither a solid with an ordered crystalline structure nor a liquid. Instead, it’s an amorphous solid where the material’s atoms and molecules don’t conform to the regularity of a crystal lattice.

磁性中有一个有趣的类似物。例如,某些材料是铁磁性的,即材料原子(或离子)的磁矩全部对齐,产生净磁性的状态。铁磁体类似于具有明确晶体结构的固体。但是,如果原子或离子的磁矩是随机取向的,则材料没有永久磁性 - 类似于玻璃的结构。每个单独的磁矩都是材料中基本粒子自旋的结果。因此,具有无序磁矩的材料被称为自旋玻璃。

There’s an interesting analog in magnetism. Certain materials, for example, are ferromagnetic, a state in which the magnetic moments of the material’s atoms (or ions) are all aligned, generating a net magnetism. A ferromagnet is analogous to a solid with a definite crystalline structure. However, if the magnetic moments of the atoms, or ions, are randomly oriented, the material has no permanent magnetism—analogous to the structure of glass. Each individual magnetic moment is the outcome of the spin of an elementary particle in the material. Hence, materials with disordered magnetic moments are called spin glasses.

20 世纪 20 年代初,德国物理学家威廉·楞次 (Wilhelm Lenz) 和他的研究生恩斯特·伊辛 (Ernst Ising) 开发了此类材料的简单模型,即伊辛模型。在他的博士论文中,Ising 分析了一维磁矩的情况。产生的自旋可以是向上 (+1) 或向下 (-1)。在该模型中,任何给定的自旋状态仅受其直接邻居的影响。例如,如果一个自旋状态为 -1,但其两个邻居都是 +1,则自旋将翻转方向。很明显,这样的系统会有一些动态,因为当每个自旋状态对其最近的邻居做出反应时,自旋翻转的影响将在系统中来回波动。如果所有自旋加在一起构成系统的状态,那么系统就会穿越状态空间,从一个状态变为另一个状态,可能稳定在某个稳定状态或持续振荡。Ising 表明,一维系统不可能是铁磁性的(这意味着自旋永远不会全部对齐到一个方向)。他甚至认为——事实证明这是错误的——即使在三维情况下也不会发生从无序到有序的状态转变。

In the early 1920s, the German physicist Wilhelm Lenz and his graduate student Ernst Ising developed a simple model of such materials. It came to be called the Ising model. For his doctoral thesis, Ising analyzed a one-dimensional case of magnetic moments. The engendering spins can be either up (+1) or down (-1). In the model, any given spin state is influenced only by its immediate neighbors. For example, if one spin state is -1, but both its neighbors are +1, then the spin will flip directions. It’s clear that such a system will have some dynamics, because as each spin state reacts to its nearest neighbors, the effects of spin flips will ripple back and forth through the system. If all the spins taken together constitute the system’s state, then the system traverses a state space, going from one state to another, possibly settling into some stable state or continually oscillating. Ising showed that a 1D system cannot be ferromagnetic (meaning, the spins will never all align in one direction). He even argued—erroneously, it turned out—that state transitions from disorderly to orderly would not happen even in the three-dimensional case.

1936年,鲁道夫·恩斯特·佩尔斯 (Rudolf Ernst Peierls) 是一位德国物理学家,他在纳粹时期离开德国,成为英国公民,他严格研究了二维模型。(佩尔斯将该模型归功于伊辛,并以此命名。)”对于足够低的温度,二维伊辛模型会表现出铁磁性,三维模型也同样如此 ”佩尔斯写道。(我不得不查找“a fortiori”。根据牛津英语词典,它的意思是“用来表达一个结论,该结论比之前接受的结论有更强的证据。”)

In 1936, Rudolf Ernst Peierls, a German physicist who left Germany during the Nazi era and became a British citizen, rigorously studied the model for the 2D case. (It was Peierls who attributed the model to Ising, giving it its name.) “For sufficiently low temperatures the Ising model in two dimensions shows ferromagnetism and the same holds a fortiori also for the three-dimensional model,” Peierls wrote. (I had to look up “a fortiori.” As per the Oxford English Dictionary, it means “used to express a conclusion for which there is stronger evidence than for one previously accepted.”)

自旋或磁矩的二维模型可能如下所示:

Here’s how a 2D model of spins, or magnetic moments, might look:

黑色“向上”箭头被赋予值 +1,灰色“向下”箭头被赋予值 -1。所显示的可能是 2D 系统在某一时刻的状态。除了沿正方形边缘的箭头外,每个箭头都有四个最近的邻居(左、右和上、下;对角线被忽略):

The black “UP” arrows are assigned a value of +1, and the gray “DOWN” arrows are assigned -1. What’s shown could be the state of a 2D system at some instant in time. Except for the arrows along the edges of the square, each arrow has four nearest neighbors (left and right and up and down; the diagonals are ignored):

给定系统的某种状态,每个自旋都会受到以下两个因素之一的影响:一些外部磁场和其最近邻居引起的磁场。后者又取决于两个自旋之间相互作用的强度(例如,它们在材料晶格中越靠近,相互作用越强),以及材料是铁磁性还是反铁磁性。(在铁磁性材料中,自旋与其最近的邻居对齐;在反铁磁性材料中,自旋倾向于相反。)

Given a certain state of the system, each spin is going to be influenced by one of two things: some external magnetic field and the magnetic fields induced by its nearest neighbors. The latter, in turn, depend on the strength of the interaction between two spins—for example, the closer they are in the material’s lattice, the stronger the interaction—and on whether the material is ferromagnetic or anti-ferromagnetic. (In ferromagnetic material, the spins align with their nearest neighbors; in anti-ferromagnetic material, the spins prefer to be in opposition.)

让我们继续讨论铁磁材料。从物理学家的角度来看,我想到的问题是:为什么晶格中自旋无序的铁磁材料最终会处于所有自旋都朝一个方向排列的状态,从而产生宏观磁性?要回答这个问题,我们必须研究一种称为哈密顿量的东西,这是一种可以计算系统总能量的方程。(是的,这个方程以我们在第 2 章中遇到的数学涂鸦艺术家威廉·罗文·汉密尔顿的名字命名。)

Let’s stick to ferromagnetic materials. From the perspective of a physicist, the question that comes to mind is this: Why should a ferromagnetic material that has disordered spins in its lattice end up in a state where all the spins are aligned in one direction, creating macroscopic magnetism? To answer this question, we have to look to something called the Hamiltonian, an equation that allows one to calculate the total energy of a system. (Yes, the equation is named after the mathematical graffiti artist we met in chapter 2, William Rowan Hamilton.)

假设σ i给出二维系统中第i 个元素的自旋值:它要么是 +1,要么是 -1。有很多方法可以写下系统的汉密尔顿量(取决于所做的假设)。这是其中之一:

Let’s say that σi gives us the value for the spin of the ith element in a 2D system: It’s either +1 or -1. There are many ways to write down the Hamiltonian of a system (depending on the assumptions one makes). Here’s one:

方程的第一项涉及取晶格中每对最近邻元素,将它们的自旋相乘,将结果乘以某个常数J,该常数表示两个相邻自旋之间的相互作用强度,然后将所有这些对的结果相加。双西格玛符号表示所有元素的总和相邻的自旋对,除了i = j的情况。(另外,如果符号只是表示,那么我们对所有相邻对求和,即使i = j 也是如此。

The first term in the equation involves taking each pair of nearest neighbors in the lattice, multiplying their spins, multiplying the result with some constant J that denotes the strength of the interaction between two adjacent spins, and then adding up the result for all such pairs. The double-sigma notation means sum over all adjacent pairs of spins, except for when i=j. (As an aside, if the notation simply says , then we sum over all adjacent pairs, even when i=j.)

汉密尔顿量的第二项取每个自旋,并将其乘以某个外部磁场h,然后将所有自旋的结果相加。如果没有外部场,则该项为零。

The second term of the Hamiltonian takes each spin and multiplies it by some external magnetic field, h, and sums the result for all spins. If there’s no external field, then this term is zero.

请注意,每个项前面都有一个减号。这具有以下物理意义:如果两个相邻的自旋具有相同的方向(+1,+1)或(-1,-1),则它们的乘积将为正。因此,在前面加上减号会使整个项为负。这会导致哈密顿量的值减小,从而导致系统的能量减小。但是,如果两个相邻的自旋具有相反的值,则它们的乘积将为负,取该项的负数将得到正值,从而提高能量。因此,如果两个自旋对齐,它们会降低系统的能量;如果它们相反,它们会提高系统的能量。即使是这种粗略的分析也告诉我们,如果所有自旋都对齐,系统的能量将达到最小值。

Note that each term is preceded by a minus sign. This has the following physical significance: If two adjacent spins have the same direction (+1, +1) or (-1, -1), their product will be positive. So, prefixing a minus sign makes the whole term negative. This causes the Hamiltonian, and hence the system’s energy, to decrease in value. However, if two adjacent spins have opposing values, their product will be negative, and taking the negative of that term results in a positive value, raising the energy. So, if two spins are aligned, they lower the energy of the system; if they are in opposition, they raise it. Even this cursory analysis tells us that if all the spins were to be aligned, the energy of the system would reach a minimum.

这应该引起你的注意。(回想一下梯度下降和到达碗底的过程。)物理系统更喜欢能量较低的配置,而不是较高的配置。在上面描述的哈密顿量中,如果常数J > 0,则该材料是铁磁体;如果J < 0,则它是反铁磁体的;如果J ij是随机的,因此表示每对自旋都有不同的J ,则该材料是自旋玻璃。

This should ring a bell. (Think back to gradient descent and reaching the bottom of a bowl.) Physical systems prefer configurations of lower, rather than higher, energy. In the Hamiltonian described above, if the constant J > 0, the material is a ferromagnet; if J < 0, it’s anti-ferromagnetic, and if Jij is random, thus signifying a different J for each pair of spins, then the material is a spin glass.

鉴于霍普菲尔德在凝聚态和固态物理学领域的工作,他对自旋玻璃有所了解。此外,他还确定了他想要解决的神经生物学“问题”:神经网络如何根据部分信息恢复存储的记忆?伊辛模型几乎是为描述他所设想的简单神经网络而量身定制的。通过对人工神经元如何相互连接做出一个更重要的假设,另一种方法是,霍普菲尔德设计一个网络,其动态特性确保存储或检索记忆类似于将神经元集合(因此网络)置于某种稳定的低能状态。这种状态的特征是神经元之间连接或权重的强度。如果你要读取这种稳定状态下的神经元输出,它们将代表一些记忆。然后,如果你通过改变神经元的一些输入来扰乱系统,从而改变它们的输出,这将构成记忆的部分破坏。如果你现在读出神经元的输出,它们将代表扭曲的记忆。但这种扰动会使系统进入某种高能状态,网络将动态地找到通往稳定状态的路径。因为那个低能量的稳定状态代表记忆,所以可以检索到那个记忆。动态过程恢复记忆。

Given his work in condensed matter and solid-state physics, Hopfield knew something about spin glasses. Also, he had identified the neurobiological “PROBLEM” he wanted to address: How does a neural network recover a stored memory based on partial information? The Ising model was almost tailor-made to describe the simple neural network he had in mind. By making one more important assumption about how the artificial neurons were connected to each other (and we’ll come to the details), Hopfield could design a network whose dynamics ensured that storing or retrieving the memory was akin to putting the ensemble of neurons, and hence the network, into some stable low-energy state. This state was characterized by the strengths of the connections, or the weights, between the neurons. If you were to read off the outputs of the neurons in this stable state, they would be representative of some memory. Then, if you were to perturb the system by changing some inputs to the neurons, and hence their outputs, this would constitute a partial disruption of memory. If you read off the outputs of neurons now, they would represent distorted memory. But this perturbation would put the system into some high-energy state, and the network would dynamically find its way to the stable state. Because that low-energy, stable state represents the memory, then that memory could be retrieved. The dynamical process restores the memory.

我们对现代神经网络的首次尝试就是从数学角度来理解这一过程。为了做到这一点,我们必须回想一下第一个人工神经元,它是在 20 世纪 40 年代设计的,我们在第 1 章中遇到过,即 McCulloch-Pitts (MCP) 神经元。

The math of how this happens is our first foray into modern neural networks. To get there, we have to harken back to the first artificial neuron, designed in the 1940s, which we encountered in chapter 1, the McCulloch-Pitts (MCP) neuron.

神经网络:复兴开始

NEURAL NETWORKS: THE REVIVAL BEGINS

尽管 Marvin Minsky 和 ​​Seymour Papert 在 1969 年出版的《感知器》一书中对该领域造成了打击,但 John Hopfield 是少数没有放弃神经网络的研究人员之一(我们将在后续章节中介绍其他研究人员,特别是 Geoff Hinton 和 Yann LeCun,他们也坚持了下来。)回想一下,Frank Rosenblatt 和其他人使用感知器收敛定理表明,如果数据集可以清晰地分为两类,感知器将始终找到一个线性分离超平面。使用训练数据训练感知器涉及找到正确的感知器输入的权重。但是,该算法仅适用于单层感知器(这意味着,您必须向感知器提供输入并读取其输出;您不能将一个感知器的输出作为另一个感知器的输入)。明斯基和帕普特用数学方法(而且非常优雅地)证明,当数据在某些给定的维度上不是线性可分时,单层感知器是无效的。然后他们推测,虽然多层感知器(一层的输出成为下一层的输入)可以解决此类问题,但可能无法训练此类网络。

John Hopfield was among the few researchers who did not give up on neural networks, despite the blow dealt to the field by Marvin Minsky and Seymour Papert in their 1969 book, Perceptrons. (We’ll meet other researchers in subsequent chapters, in particular Geoff Hinton and Yann LeCun, who also kept the faith.) Recall that Frank Rosenblatt and others had shown, using the perceptron convergence theorem, that the perceptron will always find a linearly separating hyperplane if the dataset can be cleanly divided into two categories. Teaching the perceptron using training data involves finding the correct set of weights for the perceptron’s inputs. However, this algorithm works only for a single-layer perceptron (meaning, you have to provide inputs to a perceptron and read off its output; you cannot feed the output of one perceptron as input to another). Minsky and Papert proved mathematically—and elegantly so—that single-layer perceptrons are ineffective when the data are not linearly separable in some given set of dimensions. They then conjectured that while multi-layer perceptrons, where the output of one layer becomes an input to the next, could solve such problems, there was likely no way to train such networks.

“明斯基不知道如何在多层网络中进行学习,并推测多层网络实际上可能无法做到单层网络无法做到的事情。一旦单层网络的收敛定理被完成,人们就会明白[这种]网络能做什么或不能做什么,这可以说是这个主题的结束,”霍普菲尔德告诉我。“明斯基错过了重点。”

“Minsky didn’t see how to do learning in multi-layer networks and surmised that perhaps multi-layer networks couldn’t actually do anything that single-layer networks couldn’t. Once the convergence theorem on single-layer networks was done and one understood what [such] networks could or couldn’t do, that was sort of the end of the subject,” Hopfield told me. “Minsky had missed the point.”

许多其他人没有。在 20 世纪 70 年代,研究人员开始探索如何训练多层感知器(或多层神经网络)。一种很快被称为反向传播或反向传播的算法的轮廓正在形成。但当时的计算能力无法胜任这项任务。“在 [20 世纪 70 年代],没有人能对任何有趣的问题进行反向传播。你不可能凭经验开发反向传播,”霍普菲尔德说。

Many others hadn’t. In the 1970s, researchers were beginning to probe how to train multi-layer perceptrons (or multi-layer neural networks). The outline of an algorithm that would soon be called backpropagation, or backprop, was taking shape. But the computing power in those days wasn’t up to the task. “Nobody could do backprop on any interesting problem in [the 1970s]. You couldn’t possibly develop backprop empirically,” Hopfield said.

这就是霍普菲尔德进入该领域时的情况,他试图回答自己的问题:“下一步是什么?”他从一个人工神经元开始,这个神经元一部分是罗森布拉特的感知器,一部分是麦卡洛克-皮茨神经元。

This was the state of affairs when Hopfield entered the field, as he tried to answer his own question: “What next?” He started with an artificial neuron that was part Rosenblatt’s perceptron and part the McCulloch-Pitts neuron.

考虑一个有两个输入x1x2的神经元。在 Hopfield 的版本中,输入被限制为双极值 1 或 -1(这是一个不太恰当的命名法,因为“双极”一词在精神病学和心理学中的含义)。每个输入都乘以其相应的权重,x1乘以w1x2乘以w2。然后将加权输入相加,得到:w 1 x 1 + w 2 x 2。

Consider a neuron that has two inputs, x1 and x2. In Hopfield’s version, the inputs are restricted to bipolar values of 1 or -1 (an unfortunate nomenclature, given the import of the word “bipolar” in psychiatry and psychology). Each input is multiplied by its corresponding weight, x1 by w1, and x2 by w2. The weighted inputs are then summed, to give us: w1x1 + w2x2.

如果加权和大于 0,则神经元输出 1(因此y = 1)。否则,它输出 -1(y = -1)。

If the weighted sum is greater than 0, then the neuron outputs a 1 (so, y = 1). Otherwise, it outputs a -1 (y = -1).

通常,会有一个额外的偏差项;我们在第 1 章中看到过。因此,如果w 1 x 1 + w 2 x 2 + b > 0,则神经元的输出为 +1;否则为 -1。但对于本章后面的分析,我们将忽略偏差项,而不会失去任何通用性。

Normally, there’s an extra bias term; we saw this in chapter 1. So, the output of the neuron will be +1 if w1x1 + w2x2 + b > 0; and -1 otherwise. But for the analysis that follows in this chapter, we will ignore the bias term without losing any generality.

这是神经元输出的正式方程:

Here’s the formal equation for the neuron’s output:

就是这样,这就是我们的神经元。霍普菲尔德的下一个直觉是创建双向连接的神经元网络。换句话说,如果神经元 A 的输出作为神经元 B 的输入,那么神经元 B 的输出将成为神经元 A 的输入。让我们分析一个简单的双神经元网络(见下图)。

That’s it; that’s our neuron. Hopfield’s next intuition was to create networks of neurons bi-directionally connected to each other. In other words, if the output of neuron A goes as input to neuron B, then the output of neuron B forms an input to neuron A. Let’s analyze a simple two-neuron network (see figure, below).

神经元 1 的输出y1成为神经元 2 的输入。而神经元 2 的输出y2成为神经元 1 的输入。每个输入都乘以其相应的权重:w 12 y 2 表示神经元 2 的输出作为神经元 1 的输入;w 21 y 1 表示神经元 1 的输出作为神经元 2 的输入。

The output of neuron 1, y1, becomes the input to neuron 2. And the output of neuron 2, y2, becomes the input to neuron 1. And each input is multiplied by its corresponding weight: w12y2 for the output of neuron 2 serving as input to neuron 1; and w21y1 for the output of neuron 1 serving as input to neuron 2.

权重有一个数字,告诉我们方向输入:w21表示信号从神经元 1 传到神经元 2;w12表示信号从神经元 2 传到神经元 1。让我们花点时间了解一下这里发生了什么。除了来自网络中其他神经元的信息外,神经元没有其他输入。另外,请注意,神经元不会与自己对话,这意味着神经元 1 的输出不会成为自身的输入。

The weights have a numbering that tells us the direction of the input: w21 means the signal is going from neuron 1 to neuron 2; w12 signifies the signal is going from neuron 2 to neuron 1. Let’s take a moment to grasp what’s happening here. The neurons have no other input except what’s coming from other neurons in the network. Also, note that a neuron doesn’t talk to itself, meaning the output of neuron 1 doesn’t become an input to itself.

这意味着神经元 1 和 2 的输出是:

This means that the outputs of neurons 1 and 2 are:

两个神经元实际上并不能构成一个网络,而且它们还不足以让我们了解可用于描述此类网络的简洁数学形式。考虑一个有三个神经元的网络(见下图)。

Two neurons don’t really make much of a network, and also, they’re not enough to get a sense of the succinct mathematical formalism that can be used to describe such networks. Consider a network with three neurons (see figure, below).

现在我们可以为每个神经元写出这些方程:

Now we can write these equations for each neuron:

w 12 y 2 + w 13 y 3:神经元 1 的加权和,即输出

w12y2 + w13y3: the weighted sum, or the output, of neuron 1

w 21 y 1 + w 23 y 3:神经元 2 的加权和,即输出

w21y1 + w23y3: the weighted sum, or the output, of neuron 2

w 31 y 1 + w 32 y 2:神经元 3 的加权和,即输出

w31y1 + w32y2: the weighted sum, or the output, of neuron 3

每个神经元输入的加权和i可以简洁地描述如下。(请注意wij项之间的“。”并且yj返回,为了清楚起见;它不是点积;而只是两个标量的乘积。)

This weighted sum of the inputs to each neuron, i, can be compactly described as follows. (Note the “.” between the terms wij and yj returns, for clarity’s sake; it’s not the dot product; rather it’s just two scalars being multiplied.)

这是对所有j 的总和,除了当i = j时(为了防止每个神经元影响自身)。对于每个神经元,如果其加权和大于 0,则神经元的输出为 1;否则为 -1。简而言之,第i个神经元的输出为:

This is a summation over all j, except for when i=j (to prevent each neuron from influencing itself). For each neuron, if its weighted sum is greater than 0, the output of the neuron is 1; or else it’s -1. Succinctly, the output of the ith neuron is:

这个简洁的公式描述了一个具有任意数量神经元的网络,其中神经元相互连接。你能看出它与伊辛磁性模型的相似之处吗?假设你有 100 个神经元,每个神经元的输出随机设置为 +1 或 -1。接下来会发生什么?想想自旋玻璃,其中材料中的磁矩是随机排列的。每个磁矩都会对其最近的邻居做出反应,然后翻转或不翻转。在我们的网络中,也会发生类似的事情。每个神经元都在听取其他每个神经元的意见。以神经元 1 为例。它从其他 99 个神经元获取输入。然后,神经元 1 将计算来自 99 个神经元的输入的加权和,如果加权和大于零,则将其输出设置为 +1;否则设置为 -1。如果新输出是神经元先前输出的负值(-1 对 1,反之亦然),则算作翻转。

This compact formula describes a network with any number of neurons, in which the neurons are connected to one another. Can you see the similarities with the Ising model of magnetism? Let’s say that you have 100 neurons, and each neuron’s output is randomly set to either +1 or -1. What happens next? Think about a spin glass, in which the magnetic moments in a material are randomly ordered. Each magnetic moment is going to react to its nearest neighbors and flip or not flip. In our network, something similar happens. Each neuron is listening to every other neuron. Take neuron 1. It’s getting inputs from 99 other neurons. Then, neuron 1 will calculate the weighted sum of the inputs from 99 neurons and will set its output to +1 if the weighted sum is greater than zero; otherwise to -1. If the new output is the negative of the neuron’s previous output (-1 versus 1, or vice versa), then that counts as a flip.

我们还没有讨论如何设置或计算权重( w12w13w21w23w31w32 ,在 3 个神经元的情况下)。我们稍后再讨论。现在,我们假设权重不对称,这意味着w12不一定等于w21,依此类推。这是 John Hopfield 最初开始研究的具有不对称权重的网络。

We haven’t yet talked about how the weights (w12, w13, w21, w23, w31, w32, in the case of 3 neurons) are set or calculated. We’ll come to that. For now, let’s assume that the weights are not symmetric, meaning w12 is not necessarily equal to w21, and so on. This is the sort of network, with asymmetric weights, that John Hopfield began studying at first.

他设计了一种计算此类网络能量的方法。他将其定义为:

He devised a method of calculating the energy of such networks. He defined it as:

当然,这些网络是计算机内部的模拟,因此它们实际上没有物理能量。但人们可以使用这个公式来计算一个类似于物理能量的数字。在下一节“带我回家”中,我们将分析为什么它表现得像能量,但现在,让我们相信霍普菲尔德的说法。

Of course, these networks are simulations inside a computer, so they don’t really have a physical energy. But one can use this formula to calculate a number that’s analogous to physical energy. In the next section, “Take Me Home,” we’ll analyze why this behaves like energy, but for now, let’s take Hopfield’s word for it.

在材料的 3D Ising 模型中,可以证明铁磁材料会动态地稳定在最低能量状态,在这个状态下,所有磁矩都对齐。霍普菲尔德在双向连接的神经元网络中寻找类似的动态。但考虑到我刚刚描述的能量函数,具有不对称权重的网络不会稳定在最低能量配置中;该网络被称为不稳定的。

In the 3D Ising model of materials, it can be shown that ferromagnetic materials will dynamically settle into the lowest energy state, and this state is one in which all the magnetic moments are aligned. Hopfield was after similar dynamics in a network of bi-directionally connected neurons. But given the energy function I’ve just described, a network with asymmetric weights will not settle into the lowest energy configuration; the network is said to be unstable.

然后霍普菲尔德有了灵感。如果权重是对称的会怎么样?因此,w12等于w21, w13等于w31 依此类推。

Then Hopfield had an insight. What if the weights were symmetric? So, w12 is equal to w21, and w13 is equal to w31, and so on.

“当我找到对称连接后,我就知道它会起作用,”霍普菲尔德告诉我。“稳定点是有保证的。”

“As soon as I got to symmetric connections, oh, I knew it’d work,” Hopfield told me. “Stable points were guaranteed.”

这一切和联想记忆有什么关系?回想一下,这个故事就是这么开始的——霍普菲尔德正在寻找一个神经生物学问题,用神经网络来解决。好吧,在讨论它的实际工作原理之前,让我们先提前透露一点信息。想象一种设置网络权重的方法,使得给定的神经元输出模式代表稳定状态,即能量最小值。(输出相当于自旋玻璃的自旋状态。)如果网络处于这种状态,它就不会再改变。这种输出模式可以被认为是您想要存储在网络中的记忆。现在,您获得了一个模式,它是该记忆的损坏版本。一些位是 +1 而不是 -1,反之亦然(其中每个位是一个神经元的输出)。让我们保持神经元的权重不变,但让我们强制它们的输出代表这种损坏的模式。霍普菲尔德发现,当你这样做时,网络不再稳定,其动态将接管一切:每个神经元都会翻转(或不翻转),直到网络达到稳定状态。如果你迫使网络进入的损坏状态与存储的记忆没有太大区别,网络将达到代表记忆的稳定状态。一旦网络达到这种稳定状态,神经元的输出将不再翻转。你可以简单地读出输出:你已经回忆起了记忆。

What’s all this got to do with associative memory? Recall that that’s how this story began—with Hopfield looking for a neurobiological problem to solve with neural networks. Well, let’s give the game away a bit in advance, before coming to how it actually works. Imagine a way of setting the weights of the network such that a given pattern of outputs of the neurons represents a stable state, an energy minimum. (The outputs are equivalent to the spin states of the spin glass.) If the network is in this state, it is not going to change any further. This pattern of outputs can be thought of as the memory you want to store in the network. Now you are given a pattern that’s a corrupted version of that memory. A few bits are +1 instead of -1, and vice versa (where each bit is the output of one neuron). Let’s leave the weights of the neurons untouched, but let’s force their outputs to represent this corrupted pattern. What Hopfield found is that when you do this, the network is no longer stable, and its dynamics take over: Each neuron flips (or not) until the network reaches a stable state. And if the corrupted state you forced the network into wasn’t very different from the stored memory, the network will reach the stable state that represents the memory. The outputs of the neurons, once the network reaches this stable state, will not flip anymore. You can simply read off the outputs: You have recalled the memory.

“当我意识到我需要对称连接时——伊辛磁性模型有很多东西可以提供——我把所有东西都拼凑起来,然后一切都变得清晰起来,”霍普菲尔德说。“只是坐下来把事情磨合好。”

“Once I understood that I needed symmetric connections—and the Ising model for magnetism had a lot to offer—I put things together, and then it was just clear,” Hopfield said. “It was just sitting down and grinding things out.”

带我回家

TAKE ME HOME

要理解 Hopfield 网络,需要涉及许多方面,它们需要掌握不同的概念思想。第一,存储记忆意味着什么?(我们在上一节中已经大致了解了答案。)第二,网络稳定意味着什么?第三,如何选择网络权重以存储记忆?第四,存储记忆和稳定状态之间有什么关系?第五,能量与这一切有什么关系?

There are many moving parts to understanding a Hopfield network, and they involve coming to grips with different conceptual ideas. One, what does it mean to store a memory? (We got a glimpse of the answer in the previous section.) Two, what does it mean for a network to be stable? Three, how does one select the weights of the network in order to store a memory? Four, what do storing memories and stable states have to do with each other? Five, what does energy have to do with all this?

让我们从一个简单的 3 神经元网络开始。网络的权重可以写成矩阵形式:

Let’s start with a simple 3-neuron network. The weights of the network can be written in matrix form:

一般来说,wij表示从神经元j到神经元i 的连接权重。这个矩阵有一些明显的方面。首先,因为神经元不会与自己对话,所以矩阵的对角线元素将为零。此外,根据 Hopfield 的要求,矩阵关于其对角线对称:wij = wji

Generically, wij represents the weight of the connection going from neuron j to neuron i. There are some obvious aspects to this matrix. First, because a neuron doesn’t talk to itself, the diagonal elements of the matrix will be zero. Also, the matrix, as per Hopfield’s requirement, is symmetric about its diagonal: wij =wji

网络有 3 个神经元,因此有 3 个输出,所以我们可以存储任何 3 位长的模式。假设我们想要存储模式“-1, 1, -1”。这意味着当神经元 1 的输出为“-1”,神经元 2 为“1”,神经元 3 为“-1”,网络应该处于稳定状态。我们必须适当地选择权重。选择或找到适当的权重(无论这样做的程序是什么)类似于教导网络;这个过程称为学习。为此,霍普菲尔德转向了几十年前的想法,我们在第 1 章中遇到过:“一起放电的神经元会连接在一起。”在这里,“连接在一起”意味着改变两个神经元之间的权重,从而增强神经元的活动。选择权重来实现这一点称为赫布学习。

The network has 3 neurons and, hence, 3 outputs, so we can store any pattern that’s 3 bits long. Let’s say we want to store the pattern “-1, 1, -1.” This means that when the output of neuron 1 is “-1,” neuron 2 is “1,” and neuron 3 is “-1,” the network should be in a stable state. We have to select the weights appropriately. Choosing or finding the appropriate weights (whatever the procedure for doing so) is akin to teaching the network; the process is called learning. For this, Hopfield turned to a decades-old idea, one we encountered in chapter 1: “Neurons that fire together wire together.” Here, “wiring together” means changing the weights between two neurons such that the activity of the neurons is reinforced. Choosing the weights to accomplish this is called Hebbian learning.

因此,如果神经元 1 的输出为y1,而神经元 2 的输出为y2,则赫布学习表明这两个神经元之间的权重由以下公式给出:

So, if neuron 1 has an output of y1, and neuron 2 has an output of y2, then Hebbian learning says that the weights between those two neurons are given by:

w 12 = w 21 = y 1. y 2

w12 = w21 = y1.y2

鉴于我们坚持使用对称连接,实际上这就是全部内容。将两个输出相乘,即可得到两个神经元之间对称连接的权重值。如果两个神经元分别输出相同的值(+1 和 +1 或 -1 和 -1),则相互权重设置为 1。如果两个神经元产生不同的值(-1 和 +1 或 +1 和 -1),则相互权重设置为 -1。

Given that we are sticking to symmetric connections, that’s all there is to it, really. Multiply the two outputs, and you get the value for the weights of the symmetric connections between two neurons. If two neurons are each outputting the same value, either +1 and +1 or -1 and -1, then the mutual weights are set to 1. If the two neurons are producing different values (-1 and +1 or +1 and -1), then the mutual weights are set to -1.

因为我们希望网络在稳定状态下输出“-1, 1, -1”(y1 = -1, y2 = 1, y3 = -1),所以这给了我们以下权重:

Because we want the network, in its stable state, to output “-1, 1, -1” (y1 = -1, y2 = 1, y3 = -1), this gives us the following weights:

更一般地:

More generally:

wij = yi.yj

wij = yi.yj

因此,我们的权重矩阵如下所示:

Consequently, our weight matrix looks like this:

请注意,它关于对角线对称,其元素全为零。

Notice that it’s symmetrical about the diagonal, whose elements are all zero.

这是一个非常简单的方法,使用矩阵和向量来生成权重矩阵:

Here’s a very simple method that uses matrices and vectors to generate the weight matrix:

表示我们想要存储的内存的向量由以下公式给出:

The vector representing the memory we want to store is given by:

y = [ y 1 y 2 y 3 ]

y = [y1 y2 y3]

通过将记忆向量的转置与其自身相乘,我们可以从记忆向量中得到一个矩阵。这也称为向量的外积。(请注意,这不是点积,点积会产生标量值。)

We can get a matrix from our memory vector by multiplying the transpose of the memory vector with itself. This is also called the outer product of vectors. (Note that this is not the dot product, which yields a scalar value.)

最终的矩阵几乎就是我们想要的,除了对角线元素之外,对角线元素的结果为 1,因为 1×1 或 -1×-1 都等于 1。为了得到我们想要的权重矩阵,我们只需从结果中减去一个 3×3 单位矩阵。

The final matrix is almost what we want, except for the diagonal elements, which turn out to be 1, because 1×1 or -1×-1 is each equal to 1. To get to our desired weight matrix, we simply subtract a 3×3 identity matrix from our result.

因此,找到任何存储模式或向量y的赫布权重只需:

So, finding the Hebbian weights for any stored pattern, or vector, y simply becomes:

W = yT y - I

W = yTy - I

其中,I是适当大小的单位矩阵。对于例如,如果存储的模式有 10 位,那么我们需要 10 个神经元,并且权重矩阵和单位矩阵都将是 10×10。

Where, I is the identity matrix of the appropriate size. For example, if the stored pattern has 10 bits, then we require 10 neurons, and both the weight matrix and the identity matrix will be 10×10.

一旦使用此方法初始化网络权重,需要回答的问题是:为什么模式是稳定的,或者为什么网络不会改变状态?“稳定”是指神经元的输出永远不会翻转的状态。

Once the network’s weights are initialized using this method, the question to answer is this: Why is the pattern stable, or why does the network not change states? By “stable,” we mean a state in which no neuron’s output should ever flip.

考虑第i个神经元,其输出为yi。

Consider the ith neuron, with the output yi.

我们知道:

We know that:

但根据赫布规则,我们还设置了:

But we have also set, as per the Hebbian rule:

wij = yi.yj

wij = yi.yj

所以:

Therefore:

yj 2始终为 1(无论yj是 +1 还是 -1)

yj2 is always 1 (regardless of whether yj is +1 or -1)

这意味着神经元永远不会翻转。网络中的任何神经元都不会翻转。我们已经证明,如果根据赫布学习规则设置权重(对于给定的期望输出模式),那么该模式就是网络的稳定状态。

This means the neuron will never flip. None of the neurons in the network will flip. We have proven that if the weights are set according to the Hebbian learning rule (for a given pattern of desired outputs), then that pattern is a stable state for the network.

我们已经解决了本节开头提出的前三个问题。现在是时候解决 Hopfield 网络中的能量概念了。我们希望稳定的存储模式代表能量最小值。这意味着对模式的任何扰动(例如,如果神经元 1 的输出被强制从 -1 翻转为 1)都应该增加网络的能量,从而导致它滑回到最低能量状态(打个比方)。如果按预期下降到能量最小值,则相当于调用存储的记忆。

We have addressed the first three questions asked at the beginning of this section. Now it’s time to tackle the concept of energy in a Hopfield network. We want the stable, stored pattern to represent an energy minimum. This means that any perturbation to the pattern (say, if the output of neuron 1 is forcibly flipped from -1 to 1) should increase the energy of the network, thus causing it to slide back, metaphorically speaking, to its minimum-energy state. The descent back to an energy minimum, if it happens as desired, is the equivalent of recalling a stored memory.

有关其工作原理的更详细说明,请参阅本页的数学尾声。以下是直观的看法。回想一下,霍普菲尔德将他的网络的能量定义为:

For a more detailed account of why this works, see the mathematical coda on this page. What follows here is an intuitive take. Recall that Hopfield defined the energy of his network as:

事实证明,当使用赫布学习规则设置网络权重时,情况如下:

It turns out that when the weights of the network have been set using the Hebbian learning rule, then the following are true:

  • 在代表存储记忆的稳定状态下,网络的能量(如上式所定义)处于局部最小值。网络可以有多个局部最小值(每个局部最小值可能代表不同的存储记忆)。在稳定状态下,神经元不会进一步翻转其输出,网络保持在该能量最小值。
  • In the stable state, which represents a stored memory, the network’s energy (as defined by the equation above) is at a local minimum. The network can have multiple local minima (each potentially representing a different stored memory). In a stable state, neurons don’t flip their outputs any further, and the network remains at that energy minimum.
  • 然而,如果你扰乱网络,比如让它存储一种稍微损坏的存储记忆模式,这将导致网络的能量增加。这种扰动状态是不稳定的,神经元将开始翻转。可以证明,当神经元翻转时,网络的整体能量会降低。这些动态会持续到网络达到稳定状态或局部能量最小值——此时,动态停止。
  • However, if you were to perturb the network, say, by making it store a pattern that’s a slightly corrupted form of a stored memory, this would cause the energy of the network to increase. This perturbed state is unstable, and the neurons will start flipping. It can be shown that when a neuron flips, the overall energy of the network decreases. These dynamics continue until the network reaches a stable state, or a local energy minimum—at which point, the dynamics cease.
  • 一旦网络达到能量最小值,神经元就会停止翻转。在这个阶段,神经元的输出可能代表一些存储的记忆。存储的记忆是否是你想要检索的记忆取决于初始扰动——如果扰动太大,网络的动态可能会将其带到与你想要检索的存储记忆相对应的能量最小值不同的能量最小值。
  • Once the network reaches an energy minimum, the neurons stop flipping. At this stage, the outputs of the neurons potentially represent some stored memory. Whether or not the stored memory is the one you intended to retrieve depends on the initial perturbation—if the perturbation is too large, it’s possible that the network’s dynamics take it to a different energy minimum than the one corresponding to the stored memory you wanted to retrieve.

以下是 Hopfield 网络功能的一个演示。假设您要存储一张黑白的 28×28 手写数字图像。即用 784 个像素来表示一个数字。每个像素可以是 0 或 1。出于我们的目的,因为我们使用的是双极神经元,所以我们可以将“0”视为等同于“-1”。基本上,任何图像都是一个包含 784 个元素的向量,每个元素要么是 -1,要么是 +1。要存储这样的向量,我们需要 784 个神经元。使用赫布规则,我们可以计算出要存储的任何图像的 784 个神经元网络的权重。

Here’s a demonstration of what’s possible with Hopfield networks. Say you want to store a black-and-white 28×28 image of a handwritten digit. That’s 784 pixels to depict a digit. Each pixel can be 0 or 1. For our purposes, because we are using bipolar neurons, we can think of “0” as being equivalent to “-1.” Basically, any image is a vector with 784 elements in it, with each element being either -1 or +1. To store such a vector, we need 784 neurons. Using the Hebbian rule, we can calculate the weights of the 784-neuron network for any image we want to store.

假设y1代表图像 1(数字 5),y2代表图像 2(数字 8)。这些数字是 MNIST 手写数字数据库中图像的修改版本。

Let’s say y1 represents image 1 (the numeral 5), and y2 represents image 2 (the numeral 8). These digits are modified versions of images from the MNIST database of handwritten digits.

为了存储第一位数字,权重矩阵的计算如下:

To store the first digit, the weight matrix is calculated as follows:

W1 = y1T y1 - I

W1 = y1Ty1 - I

其中,W1是一个 784×784 矩阵,I是一个 784×784 单位矩阵。这一单一操作会更新网络权重,使图像 1 成为存储的记忆。如果您在此阶段读出神经元的输出,则可以重建图像,因为每个神经元都会输出相应的像素值。

Where, W1 is a 784×784 matrix, and I is a 784×784 identity matrix. This single operation updates the weights of the network such that image 1 becomes a stored memory. If you were to read out the outputs of the neurons at this stage, you could reconstruct the image, as each neuron outputs the corresponding pixel value.

但是如果我们想在同一个网络中存储另一幅图像怎么办?如果我们只想存储第二幅图像,我们将权重设置为W2,其中:

But what if we wanted to store another image in the same network? If we wanted to store only the second image, we’d set the weights to W2, where:

W2 = y2T y2 - I

W2 = y2Ty2 - I

但是如果你想将两个图像存储在同一个网络中,那么复合权重矩阵将是:

But if you wanted to store both images in the same network, then the composite weight matrix would be:

这与以下内容相同:

This is the same as:

更一般地,如果你想存储n 个记忆,那么:

More generally, if you wanted to store n memories, then:

(题外话:Hopfield 表明,如果你有n 个神经元,网络最多可以存储0.14×n 个记忆。因此,一个有 784 个神经元的网络可以存储大约 109 个记忆。如果你将每个记忆视为 n 维空间中的向量,那么这 109 个向量将需要几乎相互正交,否则它们会互相干扰。过去几年,在增加存储容量方面取得了长足的进步,从而产生了称为现代 Hopfield 网络的网络。)

(An aside: Hopfield showed that if you have n neurons, the network can store at most 0.14×n memories. So, a network with 784 neurons can store about 109 memories. If you think of each memory as a vector in n-dimensional space, then these 109 vectors will need to be almost mutually orthogonal, otherwise they’ll interfere with one another. There have been considerable advances in the past few years to increase the storage capacity, leading to networks called modern Hopfield networks.)

假设我们将这两幅图像作为记忆存储在 784 个神经元的 Hopfield 网络中。现在我们想根据记忆的某个片段来检索记忆。让我们取数字 8 并随机更改一些像素。

Let’s say we stored the two images as memories in our 784-neuron Hopfield network. Now we want to retrieve a memory given some fragment of it. Let’s take the digit 8 and randomly change some pixels.

我们将此图像输入到我们的网络。“输入图像”意味着将每个神经元的输出设置为 +1 或 -1,具体取决于扰动图像中相应像素的值。我们检索图像的算法如下:

We feed this image to our network. “Feeding the image” means setting each neuron’s output to either +1 or -1, depending on the values of the corresponding pixels in the perturbed image. Our algorithm for retrieving an image goes like this:

  • 步骤1.计算扰动网络的能量。
  • Step 1. Calculate the energy of the perturbed network.
  • 第 2 步:从 1 到 784 中随机选择一个神经元。
  • Step 2. Pick a neuron at random from 1 to 784.
  • 步骤3.根据所有其他神经元的输出和权重矩阵计算其输出。
  • Step 3. Calculate its output based on the outputs of all other neurons and the weight matrix.
  • 步骤 4.确定神经元是否应该翻转。如有必要,请翻转。
  • Step 4. Figure out whether the neuron should flip or not. Flip it if necessary.
  • 步骤5.计算新能量。
    • 步骤 5a。如果 (旧能量 - 新能量) <= e,其中e是一个非常小的值,则终止该过程。这基本上意味着神经元翻转后的能量变化非常小,因此我们可能接近某个局部最小值。
    • 步骤 5b.如果 (旧能量 - 新能量) > e,则转到步骤 1(本质上,随机迭代所有神经元,一遍又一遍,直到达到能量最小值)。
  • Step 5. Calculate the new energy.
    • Step 5a. If (old energy – new energy) <= e, where e is some really small value, then terminate the process. This basically means that the change in energy after the neuron has flipped is extremely small, so we are likely near some local minimum.
    • Step 5b. If (old energy – new energy) > e, then go to step 1 (essentially, iterate over all the neurons at random, over and over, until you reach an energy minimum).

使用该算法,当输入有噪声的图像时,网络会检索存储的图像:

Using this algorithm, the network, when fed a noisy input image, retrieves the stored image:

我们还可以通过随机初始化 28×28 图像(像素值为 +1 和 -1)来创建扰动图像(下图左图),将其输入到网络,然后查看它检索到的内容(下图右图)。在本例中,网络检索到数字 8,但它也可能返回数字 5。

We can also create perturbed images by randomly initializing a 28×28 image with +1 and -1 for its pixel values (below, left), feeding it to the network, and seeing what it retrieves (below, right). In this case, the network retrieves the digit 8, but it could also have come back with the digit 5.

请注意,如果您获得了一个带有一些存储记忆的 Hopfield 网络,那么您所能访问的只是网络的权重。您真的不知道权重矩阵代表了哪些存储记忆。因此,当给出上面显示的扰动图像时,我们的 Hopfield 网络会动态下降到某个能量最小值,这非常令人惊奇。如果您在此阶段读取输出并将其转换为图像,您将检索一些存储的记忆。

Note that if you are given a Hopfield network with some stored memories, all you have access to are the weights of the network. You really don’t know what stored memories are represented by the weight matrix. So, it’s pretty amazing that when given the perturbed image shown above, our Hopfield network dynamically descends to some energy minimum. If you were to read off the outputs at this stage and convert that into an image, you would retrieve some stored memory.

有时,会发生一些奇怪的事情。例如,在下图中,给定一个不同的扰动图像(左),网络检索出略有不同的图像(右)。

Occasionally, something strange happens. In the figures below, for example, given a different perturbed image (left), the network retrieves a slightly different image (right).

发生了什么?很明显,检索到的数字是数字 8,但图像已被反转:黑色像素变成了白色,反之亦然。像素的值是神经元的输出。事实证明,能量景观(或能量作为函数)神经元权重和输出的最小值)对于每个存储的记忆都有两个最小值。如果一组输出(即像素值)给出一个最小值,则翻转的输出(1 变为 -1,或白色变为黑色,反之亦然)会给出能量景观中的另一个最小值,但能量景观是相同的。因此,根据扰动图像所决定的初始起点,您可以落在其中一个最小值或另一个最小值中。因此,在某些情况下,我们最终会得到一个位翻转的图像。

What happened? It’s clear that the retrieved number is the digit 8, but the image has been inverted: Black pixels have turned white, and vice versa. The values of the pixels are the outputs of the neurons. As it turns out, the energy landscape (or energy as a function of the weights and outputs of the neurons) has two minima for each stored memory. If one set of outputs (hence, pixel values) gives you one minimum, outputs that are flipped (1 becomes -1, or white becomes black and vice versa) give you the other minimum in the energy landscape, but an identical one in terms of the energy. So, depending on your initial starting point, as dictated by the perturbed image, you can land in one or the other minimum. So, in some cases, we end up with a bit-flipped image.

然而,我们在网络中存储了两幅图像,分别代表数字 5 和 8。有时,随机扰动的图像会下降到代表数字 5 或其位翻转兄弟的能量最小值(参见上图)。

We, however, stored two images in the network, representing the numerals 5 and 8. Sometimes, the randomly perturbed image descends to an energy minimum that represents the digit 5 or its bit-flipped sibling (see panels, above).

还有可能的是,如果你拍摄数字 8 的图像,对其进行扰动,并将网络设置为表示扰动后的图像,它可能会下降到能量最小值,从而检索数字 5 的图像。这是因为扰动使网络处于能量景观的某个部分,该部分在某种程度上更接近代表 5 的最小值,而不是代表 8 的最小值。

It’s also possible that if you took the image of the digit 8, perturbed it, and set the network to represent the perturbed image, it might descend to an energy minimum that resulted in the retrieval of an image of the digit 5. That’s because the perturbation landed the network on a part of the energy landscape that was somehow closer to the minimum representing 5 than to the minimum representing 8.

如果您确实构建了自己的 Hopfield 网络,那么存储自己的照片并查看是否可以检索(完全正常的行为)就毫无意义了。以下就是其中一个结果。

If you ever do build your own Hopfield network, it’s not at all vain to store your own photograph and see if it can be retrieved (perfectly normal behavior). Here’s one such outcome.

左上角的图像是存储的图像,右上角的图像是相同的图像,但添加了很多噪声,底部的图像是网络在给定噪声图像的情况下检索到的图像。

The image on the top left is what was stored, the image on the top right is the same image with a lot of noise added, and the bottom image is what the network retrieved, given the noisy image.

在采访约翰·霍普菲尔德时,我觉得很难提到以他的名字命名的网络。“当我和你说话时,称它们为霍普菲尔德网络感觉很奇怪,但你一定一直都有这种经历,”我说。

When interviewing John Hopfield, I found it awkward to refer to his eponymous network by name. “It feels weird to be calling them Hopfield networks while I’m talking to you, but you must have experienced this all along,” I said.

“我已经放弃了,”霍普菲尔德笑着说道。

“I’ve given up on that,” said Hopfield, smiling.

现在可能已经过时了,但在 1981 年,当霍普菲尔德完成他的工作时,没有人特别有兴趣发表它。没有神经生物学家会看霍普菲尔德写的东西并声称它与神经生物学有任何关系。“它不可能被发表在同行评审的出版物上,”霍普菲尔德告诉我。事实上,霍普菲尔德因早期在物理学领域的工作而成为美国国家科学院院士。“如果你是科学院院士,你基本上可以凭自己的意愿每年发表几篇论文,”他说。“除非他们不道德,否则没有人会批评他们。”

It might be passé now, but in 1981, when Hopfield finished his work, no one was particularly interested in publishing it. No neurobiologist would have looked at what Hopfield had written and claimed it had anything to do with neurobiology. “There would have been no hope of getting it [into] a refereed publication,” Hopfield told me. As it happened, Hopfield was a member of the National Academy of Sciences, for his earlier work in physics. “If you were an academy member, you could essentially, on your own say-so, publish a few papers a year,” he said. “Nobody would criticize them at all, unless they were immoral.”

但早在 1981-82 年,《美国国家科学院院刊》PNAS)几乎没有发表过数学或计算机科学方面的文章。而这些学科正是霍普菲尔德论文的核心。此外,他还有一个问题:PNAS 的篇幅限制为 5 页。霍普菲尔德必须将他的工作精简到 5 页,同时希望吸引数学思维的计算机科学家和神经生物学家的注意。这篇论文发表了。他在《现在怎么办?》一文中回忆了这个过程。霍普菲尔德引用了海明威的话:

But back in 1981–82, the Proceedings of the National Academy of Sciences (PNAS) published almost nothing in mathematics or computer science. And those disciplines were at the heart of Hopfield’s paper. Plus, he had one more problem: PNAS had a five-page limit. Hopfield had to distill his work down to five pages while simultaneously hoping to attract the attention of mathematically minded computer scientists and neurobiologists. The paper was published. Reminiscing about the process in his essay “Now What?,” Hopfield quotes Hemingway:

关于非虚构文学的写作,欧内斯特·海明威曾评论道:“如果散文作家对自己所写的内容有足够的了解,他可能会省略自己知道的东西,而如果作者写得足够真实,读者就会强烈地感受到这些东西,就好像作者已经陈述了它们一样。” PNAS 的篇幅限制迫使我对所要讲述的内容和省略的内容进行严格筛选。如果海明威是物理学家,他就会认出这种风格。事后看来,省略那些几乎显而易见的内容可能会增加论文的影响力。未说明的内容成为邀请其他人加入该主题的邀请,从而鼓励贡献者社区研究此类网络模型。成功的科学始终是一项社区事业。
Concerning the writing of non-fiction, Ernest Hemingway remarked, “If a writer of prose knows enough about what he is writing about he may omit things that he knows and the reader, if the writer is writing truly enough, will have a feeling of those things as strongly as though the writer had stated them.” The PNAS length limitation forced me to be highly selective in what was said—and what was omitted. Had Hemingway been a physicist, he would have recognized the style. In hindsight, the omission of the almost obvious probably increased the impact of the paper. The unstated became an invitation for others to add to the subject, and thus encouraged a community of contributors to work on such network models. Successful science is always a community enterprise.

Hopfield 于 1982 年发表在PNAS 上的论文已成为该领域的经典之作。它促进了人们对神经生物系统(当然包括我们的大脑)是动态的并且可以用数学建模的理解。这是一个巨大的进步。然后是学习的问题。(毕竟,这是一本关于机器学习的书。)Hopfield 网络是所谓的一次性学习者。给定一个数据实例,网络可以记住它。但我们的大脑所做的大量学习都是渐进的:给定足够的数据,我们就会慢慢了解其中的模式。

Hopfield’s 1982 PNAS paper has become a classic in the field. It fostered the understanding that neurobiological systems—our brains included, of course—are dynamical and can be mathematically modeled as such. This was a huge advance. Then there was the question of learning. (After all, this is a book on machine learning.) Hopfield networks are what are called one-shot learners. Given one instance of data, the network can memorize it. But an awful lot of the learning our brains do is incremental: Given enough data, we slowly learn about patterns in them.

增量训练是神经网络研究人员的一个关键目标。我们在第 1 章中看到了如何增量训练单层感知器或神经网络。但这种网络有很大的局限性。目标是转向多层神经网络,但没有人知道如何有效地训练它们。到 1986 年,随着反向传播算法的第一个详细阐述的发表,这种情况也发生了不可逆转的变化。几年后,数学家乔治·西本科 (George Cybenko) 的另一篇论文进一步激起了人们对神经网络的热情:西本科表明,某种多层网络,只要有足够的神经元,就可以近似任何函数,将输入转换为期望的输出。在讨论反向传播之前,我们将先讨论神经网络的经典发现之一,即通用近似定理。

Incremental training was a key goal for neural network researchers. We saw in chapter 1 how to incrementally train a single-layer perceptron, or neural network. But such networks had enormous limitations. The goal was to move to multi-layer neural networks, but no one knew yet how to efficiently train them. By 1986, that, too, had changed, irrevocably, with the publication of the first detailed exposition of the backpropagation algorithm. And within a few years of that, another paper, by a mathematician named George Cybenko, further inflamed passions about neural networks: Cybenko showed that a certain kind of multi-layer network, given enough neurons, could approximate any function in terms of transforming an input into a desired output. Before we tackle backpropagation, we’ll jump ahead to one of the classic findings about neural networks, the universal approximation theorem.

数学尾声

MATHEMATICAL CODA

收敛证明/HOPFIELD 网络

CONVERGENCE PROOF / HOPFIELD NETWORK

定理:如果处于稳定状态的 Hopfield 网络受到干扰,那么它将动态地通过一系列状态转换,直到达到并稳定在代表能量最小值的稳定状态。

Theorem: If a Hopfield network in a stable state is perturbed, then it will dynamically transition through a series of states until it reaches and settles into a stable state that represents an energy minimum.

下面的证明是受到 Raúl Rojas 的书《神经网络:系统介绍》中精彩阐述的启发。让我们从所谓的双极神经元网络开始,它产生 +1 或 -1 的输出。神经元以对称权重相互连接。神经元的输出不会反馈给自身。网络的权重矩阵由n × n矩阵给出,用于​​n个神经元的网络:

The proof that follows is inspired by a brilliant exposition of it in Raúl Rojas’s book Neural Networks: A Systematic Introduction. Let’s begin with a network of what are called bipolar neurons, which produce an output of +1 or -1. The neurons are connected to each other with symmetric weights. A neuron’s output does not feed back to itself. The network’s weight matrix is given by an n×n matrix for a network of n neurons:

对于要存储在网络中的模式,我们使用赫布学习规则。

For a pattern to be stored in the network, we use the Hebbian learning rule.

wij = yi.yj

wij = yi.yj

更一般地,如果有n 个神经元,则存储的内存长度为n位,由向量y = [ y 1 y 2 … yn ]给出。权重矩阵可以按如下方式计算:

More generally, if there are n neurons, the stored memory is n bits long and is given by the vector y = [y1 y2 … yn]. The weight matrix can be calculated as follows:

神经元是否翻转取决于它所连接的所有其他神经元的权重和输出。对于神经元i,它取决于:

Whether or not a neuron flips depends on the weights and the outputs of all the other neurons to which it’s connected. For neuron i, it depends on:

yi old :神经元i在响应其他神经元之前的当前状态

yiold: the current state of neuron i before it responds to other neurons

yi new :神经元i在对其他神经元作出反应后的新状态

yinew: the new state of neuron i after it responds to other neurons

这个量通常被称为神经元i的“场” (类似于某种材料内部单个磁矩所经历的磁场)。如果神经元的场与其当前状态的符号相反,则神经元翻转;否则,则不翻转。

The quantity is often called the “field” of neuron i (analogous to the magnetic field experienced by a single magnetic moment inside some material). If the field of a neuron has the opposite sign to its current state, the neuron flips; otherwise, it doesn’t.

Hopfield 对网络能量的定义如下:

Hopfield defined the energy of the network in these terms:

我们以一个有 3 个神经元的网络为例。因此,权重分别为w11w12w13w21w22w23w31w32w33。我们知道w11w22w33为零。以下是所有项展开后的能量:

Let’s take a network with 3 neurons. So, the weights are w11, w12, w13, w21, w22, w23, w31, w32, and w33. We know that w11, w22, and w33 are zero. Here’s the energy, with all the terms expanded:

这可以重新组织,重点关注神经元 1:

This can be reorganized, with a focus on neuron 1:

我们知道w12 = w21、w13 = w31,等等……因此,再次重写:

We know that w12=w21, w13=w31, and so on…So, rewriting again:

这可以概括为:

This can be generalized:

该方程有两个项,一个特定于y1,另一个与除y1 之外的所有其他神经元有关。假设翻转的是神经元 1。因此,我们关注第一个神经元的两个输出:

The equation has two terms, one that is specific to y1 and another that is about all the other neurons except y1. Let’s say that it’s neuron 1 that flips. So, we are concerned with two outputs of the first neuron:

y 1 old:神经元 1 的当前状态,在它对其他神经元作出反应之前

y1old: the current state of neuron 1, before it responds to other neurons

y 1 new:神经元 1 在对其他神经元做出反应后的新状态

y1new: the new state of neuron 1, after it responds to other neurons

使用特定于神经元 1 的方程,我们有两种能量,一种是在神经元翻转之前,一种是在翻转之后:

Using the equation specific to neuron 1, we have two energies, one before the neuron flips and one after:

神经元1翻转后能量的差异为:

The difference in energy after neuron 1 flips is:

∇E = E- E

∇E = Enew - Eold

如果你对能量函数之前的 ½ 感到疑惑,那么它就派上用场了。½ 抵消了求和之前的 2。(这就是数学家的把戏。)

If you were wondering about ½ before the energy function, this is where it comes in handy. The ½ cancels out the 2 before the summation. (Such are the tricks of mathematicians.)

因此,当某个第 i 个神经元(在我们的例子中是神经元 1)的状态从 +1 翻转为 -1 或从 -1 翻转时,这就是能量的变化。为了通用性,我们将放弃对神经元 1 的具体描述,而只是将第 i个神经元称为翻转的神经元。回想一下,这是第 i个神经元的场。它总是具有与yi old相反的符号:这就是神经元翻转的原因。

So, this is the change in energy, when some ith neuron (in our case, neuron 1) flips states from +1 to -1 or vice versa. For the sake of generality, we’ll abandon being specific about neuron 1 and just refer to the ith neuron as the one that flips. Recall that is the field of the ith neuron. It’ll always have the opposite sign to that of yiold: That’s why the neuron flips.

因此,如果yi old为 +1,则yi new为 -1,因为它具有负号。

So, if yiold is +1, then yinew is -1, because has a negative sign.

= + 2 ×负数

= + 2 × NegNumber

=负数

= NegNumber

如果yi old为 -1,则yi new为 +1,因为它具有正号。

If yiold is -1, then yinew is +1, because has a positive sign.

= - 2 ×位置号

= - 2 × PosNumber

=负数

= NegNumber

无论第i个神经元从+1翻转到-1,还是从-1翻转到+1,能量的变化都是负数,即系统的总能量下降。“由于可能的状态只有有限的集合,网络最终必须达到能量无法进一步降低的状态,”Rojas 写道

Regardless of whether the ith neuron flips from +1 to -1 or from -1 to +1, the change in energy is a negative number, meaning the total energy of the system goes down. “Since there is only a finite set of possible states, the network must eventually reach a state for which the energy cannot be reduced further,” Rojas writes.

因此,如果一系列神经元翻转持续降低网络的能量,直到达到没有神经元翻转的状态,那么该状态就代表局部能量最小值。这是一个稳定状态。一旦网络稳定在该状态,它就无法再改变状态。

So, if a series of neuron flips keeps reducing the energy of the network until it reaches a state where no neuron flips, then that state represents a local energy minimum. It’s a stable state. Once the network settles into that state, it cannot change states any further.

量子电动力学

QED

第九章

CHAPTER 9

阻碍深度学习的人(并非如此)

The Man Who Set Back Deep Learning (Not Really)

乔治·西本科 (George Cybenko) 对他受到的欢迎感到惊讶。他只是 2017 年在西班牙毕尔巴鄂举办的暑期深度学习课程的众多杰出教授之一。到那时,深度学习——或训练具有三层或更多层(一个输入层、一个输出层以及一个或多个所谓的隐藏层,隐藏在输入和输出之间)的神经网络的过程——已经占领了机器学习的世界。近一千三百人参加了这所学校,西本科为其中大约四百人讲授了一堂长达一小时的迷你课程。休息期间,学生们过来要求与他合影。我感觉自己就像一个摇滚明星,”新罕布什尔州汉诺威达特茅斯学院的工程学教授 Cybenko 告诉我。

George Cybenko was surprised by the reception he got. He was only one among the many luminaries teaching summer school on deep learning in 2017, in Bilbao, Spain. By then, deep learning—or the process of training neural networks that have three or more layers (one input layer, one output layer, and one or more so-called hidden layers tucked in between the input and output)—had taken over the world of machine learning. Nearly thirteen hundred people attended the school, and Cybenko taught an hours-long mini-course to about four hundred of them. During breaks, students came up and requested selfies with him. “I felt like a rock star,” Cybenko, a professor of engineering at Dartmouth College, Hanover, New Hampshire, told me.

这种感觉有点被破坏了Cybenko 后来读到的关于这个暑期项目的博客文章。博主提到,另一位备受推崇的人工智能研究员、深度学习革命的先驱之一邓力曾打趣说,让 Cybenko 在暑期学校感觉自己像个摇滚明星的定理可能已经大大阻碍了深度学习领域的发展。“所以,在某些圈子里,我就是那个把深度学习推迟了二十年的人“我被指控为一个在二十年内阻碍了整个领域发展的人,即使是开玩笑,这也是一个相当大的指控。“它误解了我的所作所为,”Cybenko 说。

That feeling was somewhat spoiled by a blog post Cybenko later read about the summer program. The blogger mentioned that another highly regarded AI researcher, Li Deng, one of the pioneers of the deep learning revolution, had reportedly quipped that the very theorem that made Cybenko feel like a rock star at the summer school may have substantially set back the field of deep learning. “So, in some circles, I’m the guy that delayed deep learning by twenty years,” a somewhat amused Cybenko told me. To be held up, even in jest, as the person whose work hampered an entire field for two decades is quite an allegation. “It misunderstood what I did,” Cybenko said.

那么,Cybenko 做了什么,一方面让他在深度学习爱好者中享有名人地位,另一方面却让一些人嘲笑其有害影响呢?要回答这些问题,我们必须在神经网络研究的时间线上向前迈进一步。

So, what did Cybenko do that brought him celebrity status among deep learning aficionados on the one hand and, yet, caused some to joke about its deleterious effect on the other? To answer these questions, we must jump ahead somewhat in the time line of research into neural networks.

回顾一下我们目前所知,在 20 世纪 50 年代末和 60 年代初,弗兰克·罗森布拉特 (Frank Rosenblatt) 和伯纳德·威德罗 (Bernard Widrow) 设计了单层神经网络及其训练算法,使这些网络成为近十年来机器学习的焦点。然后,在 1969 年,明斯基和帕普特出版了他们的书《感知器》,在书中他们优雅地证明了单层神经网络存在局限性,同时暗示(没有证据)多层神经网络可能同样无用,这实际上扼杀了该研究领域并带来了第一个人工智能寒冬。

To recap what we know so far, in the late 1950s and early ’60s, Frank Rosenblatt and Bernard Widrow devised single-layer neural networks and the algorithms to train them, making these networks the focus of machine learning for almost a decade. Then, in 1969, Minsky and Papert published their book, Perceptrons, in which they elegantly proved that single-layer neural networks had limitations, while insinuating (without proof) that multi-layer neural networks would likely be similarly useless, effectively killing that field of research and bringing about the first AI winter.

然而,并非所有人都放弃了。1981-82 年,约翰·霍普菲尔德 (John Hopfield) 发明了霍普菲尔德网络。然而,这些网络是一次性学习者。它们不需要多层神经网络从数据中逐步学习所需的那种训练。到 20 世纪 70 年代中期和 80 年代初,少数研究人员开始阐明可用于训练多层网络的算法的基本要素。然后,在 1986 年,大卫·鲁梅尔哈特 (David Rumelhart)、杰弗里·辛顿 (Geoffrey Hinton) 和罗纳德·威廉姆斯 (Ronald Williams) 在《自然》杂志上发表了一篇开创性的论文,展示了一种名为反向传播的训练算法的优势,从而为深度学习的发展铺平了道路。(不过,正如我们将在下一章中看到的,他们并不是第一个想到反向传播的人;它的历史可以追溯到罗森布拉特。)

Yet, not everyone gave up. In 1981–82, John Hopfield figured out Hopfield networks. These networks, however, were one-shot learners. They didn’t need the kind of training that multi-layer neural networks required to learn incrementally from data. By the mid-1970s and early ’80s, a handful of researchers had begun elucidating the fundamental elements of an algorithm that could be used to train multi-layer networks. Then, in 1986, David Rumelhart, Geoffrey Hinton, and Ronald Williams published a seminal paper in the journal Nature, showing off the strengths of a training algorithm called backpropagation, thus greasing the wheels of deep learning and setting it in motion. (Though, as we’ll see in the next chapter, they weren’t the first to think of backpropagation; its history goes back to Rosenblatt.)

据称,Cybenko 的工作踩下了这些轮子。他具有里程碑意义的论文发表于 1989 年。考虑到发展的时间顺序,先了解反向传播然后再处理 Cybenko 定理是有意义的。但我们将颠倒顺序。首先掌握 Cybenko 的工作将使我们更好地理解深度神经网络和反向传播,并将给我们一个很好的借口来深入研究有关函数的一些有趣细节,以及为什么它们可以被视为向量,以及所有这些是如何在 Cybenko 的“反证法”中结合在一起的。他的证明表明,如果有足够的神经元,只有一个隐藏层的神经网络可以近似任何函数,这意味着它可以将输入转换为任何所需的输出。想一想:一个具有任意数量神经元的隐藏层可以表示任何函数,无论多么复杂。例如,该函数可以是接受输入并产生简单音调或复杂语音波形的函数;或识别图像;甚至生成新的图像。这个定理被称为通用近似定理。

It’s these wheels on which Cybenko’s work allegedly put the brakes. His landmark paper was published in 1989. Given the chronology of the developments, it’d make sense to understand backpropagation first and then deal with Cybenko’s theorem. But we’ll reverse the order. Coming to grips with Cybenko’s work first will set us up for a better understanding of deep neural networks and backpropagation and will give us a great excuse to delve into some delicious details about functions and why they can be regarded as vectors and how all that gets pulled together in Cybenko’s “proof by contradiction.” His proof showed that a neural network with just one hidden layer, given enough neurons, can approximate any function, meaning it can turn an input into any desired output. Think about it: One hidden layer, with an arbitrarily large number of neurons, can represent any function, no matter how complicated. For example, the function could be one that takes an input and produces a simple tone or a complex speech waveform; or recognizes images; or even generates new images. The theorem is called the universal approximation theorem.

1986 年 Rumelhart、Hinton 和 Williams 的论文中详细介绍了反向传播算法,该算法的含义是,现在可以训练多层神经网络,同时还要考虑到实际问题,例如缺乏计算能力和训练数据。当时,Cybenko 已经获得了信号处理数学博士学位,他对这些网络的前景很感兴趣。“明斯基和佩普特的这些负面结果让我们很感兴趣,但人们还是在努力,并取得了成果。这促使我去弄清楚这件事,”他告诉我。“单隐藏层网络能做什么?”

The implication of the backpropagation algorithm, detailed in the 1986 Rumelhart, Hinton, and Williams paper, was that multilayer neural networks could now be trained, while one kept in mind practical concerns such as lack of both computing power and training data. At the time, Cybenko, who had done his Ph.D. in the mathematics of signal processing, was intrigued by the promise of these networks. “We had these negative results by Minsky and Papert, and yet people were doing stuff and getting things done. That motivated me to get to the bottom of it,” he told me. “What can a single-hidden-layer network do?”

在我们讨论多层网络之前,这里先描述一个没有隐藏层的单层感知器或单层神经网络:

Before we get to multi-layer networks, here’s a depiction of a single-layer perceptron, or a single-layer neural network, with no hidden layer:

输入层只是指神经网络的输入。它们本身并不是人工神经元。例如,在上图中,神经网络的输入是一个三维向量 [ x1x2x3 ]。这个垂直堆叠的向量代表输入层。网络中只有一层实际的人工神经元,因此得名“单层神经网络”。这就是输出层。图中仅显示一个神经元作为输出层的一部分,但可以有任意数量的神经元,垂直堆叠。输出层中的每个神经元都接收整个向量作为输入。向量的每个元素都乘以其自身的权重。每个输出神经元计算其输入的加权和,添加偏差,然后使用阈值函数来生成其输出。

The input layer simply refers to the inputs to the neural network. They are not artificial neurons in themselves. For example, in the figure above, the input to the neural network is a three-dimensional vector [x1, x2, x3]. This vector, stacked vertically, represents the input layer. There is only one layer of actual artificial neurons in the network, hence the name “single-layer neural network.” And that’s the output layer. The figure shows only one neuron as part of the output layer, but there can be any number of them, stacked vertically. Each neuron in the output layer receives the entire vector as input. Each element of the vector is multiplied by its own weight. Each output neuron computes the weighted sum of its inputs, adds a bias, and then uses a thresholding function to generate its output.

因此,对于产生 +1 或 -1 的双极神经元:

So, for a bipolar neuron, which produces either +1 or -1:

回想一下第 1 章,我们可以将单个神经元的方程写如下:

Recall from chapter 1 that we can write the equations for a single neuron as follows:

我们在第 1 章中遇到的感知器训练算法可用于训练此网络。进一步回顾一下,通过监督学习进行训练意味着获取几个标记训练数据实例(其中每个标记实例是输入端x的某个值和相应输出y的某个值)并迭代这些实例以得出一些接近最优的权重和偏差集,以便网络找到一些线性分离超平面。一旦我们有了权重和偏差,然后给定一些新的x,就很容易估计输出y 。

The perceptron training algorithm we encountered in chapter 1 can be used to train this network. To further recap, training via supervised learning means taking several instances of labeled training data (where each labeled instance is some value for x on the input side and a value for the corresponding output, y) and iterating over these instances to arrive at some near-optimal set of weights and bias, so that the network finds some linearly separating hyperplane. Once we have the weights and bias, then given some new x, it’s easy to estimate the output y.

对于上面的示例网络,算法必须学习此权重矩阵和偏差项的值:

For the example network above, the algorithm must learn the values for this weight matrix and the bias term:

然而,感知器训练算法只适用于单层网络。如果你有类似下图的网络,它将失败。(请注意,权重现在有下标来指示相关层。)

However, the perceptron training algorithm works only for a single-layer network. If you have something like the network shown below, it will fail. (Note that the weights now have subscripts to indicate the pertinent layer.)

所示的网络有一个隐藏的神经元层——之所以隐藏,是因为该层不直接暴露在输出端。隐藏神经元的输出被馈送到输出层的神经元中。在这个例子中,输出层中只有一个神经元,但我们可以有任意多个神经元,如果有多个输出神经元,则每个输出都将成为输出向量y的一个元素。这里要注意的重要一点是,现在有两个权重矩阵,一个用于输入和隐藏层之间的连接,另一个用于隐藏层和输出层之间的连接。(如果有更多隐藏层,矩阵的数量也会相应增加。)撇开偏差项(它们始终存在并且也必须学习)不谈,上面显示的网络的两个权重矩阵是:

The network shown has one hidden layer of neurons—hidden because the layer is not directly exposed on the output side. The outputs of the hidden neurons feed into the neurons of the output layer. In this example, again, we have only one neuron in the output layer, but there can be as many as we want, and if you have more than one output neuron, each output becomes an element of the output vector, y. The important thing to notice here is that there are now two weight matrices, one for the connections between the inputs and the hidden layer and another for the connections between the hidden and output layers. (If there were more hidden layers, the number of matrices would correspondingly increase.) Leaving aside the bias terms—they are always present and must be learned, too—the two weight matrices for the network shown above are:

(关于符号的说明:权重有下标,每个下标代表层,还有两个数字:左边的数字代表前一层的神经元,其输出作为当前层神经元的输入;右边的数字表示当前层中接收输入的神经元(在本书后面,这两个数字将作为上标出现)。还请注意,与 Hopfield 网络使用的符号相比,这种索引是相反的。我们可以选择相同的符号,但您经常会遇到这些不同的权重、偏差和输出索引方式,所以这只是让我们保持警惕。今后,我们将坚持使用这种符号。)

(A note on notation: The weights have subscripts, where each subscript refers to the layer, and two numbers: The number on the left refers to the neuron in the previous layer, whose output serves as input to the neuron in the current layer; and the number on the right denotes the neuron receiving the input in the current layer (later in this book, the two numbers will appear as superscripts). Also note that this indexing is reversed compared to the notation used for Hopfield networks. We could have chosen the same notation, but you’ll often encounter these differing ways of indexing the weights, biases, and outputs, so this just serves to keep us on our toes. Henceforth, we’ll stick to this notation.)

如果网络需要多个权重矩阵(一个用于输出层,一个用于每个隐藏层),则它被称为深度神经网络:隐藏层的数量越多,网络越深。

If a network requires more than one weight matrix (one for the output layer and one for each hidden layer), then it’s called a deep neural network: the greater the number of hidden layers, the deeper the network.

当网络具有多个权重矩阵时,感知器训练算法不起作用。到 20 世纪 80 年代中后期,研究人员借助反向传播算法(我们将在下一章中介绍)成功训练了一些深度神经网络;该算法可以处理隐藏层。“但当时人们对此并不了解。你可以训练它做什么?如果有的话,限制是什么?有一种有效的算法,但有时有效,有时无效,”Cybenko 说。

The perceptron training algorithm doesn’t work when the network is characterized by more than one weight matrix. By the mid- to late 1980s, researchers were successfully training some deep neural networks thanks to the backpropagation algorithm (which we’ll come to in the next chapter); the algorithm could deal with hidden layers. “But, at the time, there was no understanding. You could train it to do what? What, if any, were the limitations? There was an effective algorithm, but sometimes it worked, sometimes it didn’t,” Cybenko said.

本质上,深度神经网络通过多个权重矩阵将输入x转换为输出y,其中输入和输出都是向量。这可以写成:

Essentially, a deep neural network, with its multiple weight matrices, transforms an input x into an output y, where both the input and output are vectors. This can be written as:

y = fx

y = f(x)

神经网络(我们将在上下文中明确的地方省略“深度”一词)然后近似某个所需的函数。因此,虽然训练网络意味着找到权重矩阵的最优值,但它也类似于找到最能近似输入和输出之间相关性的函数。但是近似函数能达到什么目的呢?首先,函数可以表示决策边界。如果一个新的数据点落在边界的一侧或另一侧,就可以相应地对其进行分类。或者再举一个例子:该函数还可以用于回归,这意味着该函数是最适合训练数据的曲线;然后,给定一个新的数据点,该函数可用于预测输出。或者,在 ChatGPT 和其他生成式 AI 实例的背景下,该函数可以表示 AI 的能力,首先,学习一个极其复杂的概率分布来模拟训练数据,然后从中进行采样,从而使 AI 能够生成符合训练数据统计数据的新数据。

A neural network—we’ll drop the word “deep” where it’s clear from the context—then approximates some desired function. So, while training a network means finding the optimal values for the weight matrices, it’s also akin to finding the function that best approximates the correlation between the inputs and the outputs. But what does approximating a function achieve? Well, for one, a function can represent a decision boundary. If a new data point falls to one side or the other of the boundary, it can be classified accordingly. Or take another example: The function can also be used for regression, meaning the function is the curve that best fits the training data; then, given a new data point, the function can be used to predict the output. Or, in the context of ChatGPT and other instances of generative AI, the function could represent an AI’s ability, first, to learn an extremely complicated probability distribution that models the training data and, then, to sample from it, thus enabling the AI to generate new data that are in accordance with the statistics of the training data.

Cybenko 想要了解神经网络的优势和局限性。它们能逼近任何函数吗?如果网络在神经元数量方面没有足够的能力来逼近所需的函数怎么办?一个理想的神经网络,拥有所需的尽可能多的神经元,能做什么或不能做什么?

Cybenko wanted to understand the strengths and limitations of neural networks. Can they approximate any function? What if the network doesn’t have the capacity, in terms of number of neurons, to approximate the desired function? What can an idealized neural network, with as many neurons as needed, do or not do?

堆叠起来

STACK ’EM UP

有一种直观的方法可以理解一个隐藏层中足够多的神经元如何近似任何给定函数。让我们采用一个足够复杂的函数(这样明斯基和帕普特的幽灵就不会指责我们对神经网络不严谨了):

There’s an intuitive way to understand how a sufficiently large number of neurons in one hidden layer can approximate any given function. Let’s take a complicated enough function (so that the ghosts of Minsky and Papert can’t accuse us of going easy on neural networks):

尽管看起来很复杂,但它仍然只是一个一维函数,其中标量输出y仅取决于标量输入x。此函数的图如下所示:

As complicated as this looks, it’s still just a one-dimensional function, in that the scalar output y depends only on the scalar input x. A plot of this function looks like this:

微积分中有一个直觉可以帮助我们理解神经网络如何以与该函数紧密匹配的方式将输入转换为输出。假设我们想要找到上面显示的曲线部分下的面积。我们可以近似地做到这一点。它涉及绘制一堆宽度相等的矩形,我们试图将它们放在曲线下方。可以通过将我们设法放在曲线下方的所有矩形的面积相加来估算曲线下的面积。矩形越薄,我们可以容纳的矩形就越多,我们就越接近正确答案。积分微积分将其发挥到极致,让我们通过允许矩形的宽度趋于零来计算面积。下图显示了一些示例。

There’s an intuition from calculus that can help us understand how a neural network might transform inputs to outputs in a manner that closely matches this function. Let’s say we want to find the area under the portion of the curve that’s shown above. We can do so approximately. It involves drawing a bunch of rectangles of equal width, which we try to fit under the curve. The area under the curve can be estimated by summing up the area of all the rectangles we managed to fit under the curve. The thinner the rectangles, the more of them we can fit and the closer we get to the correct answer. Integral calculus takes this to the limit, letting us calculate the area by allowing the width of the rectangles to tend to zero. The figures below show some examples.

我们对积分学或曲线下面积不感兴趣。但该方法告诉我们单隐层神经网络如何解决逼近任何函数的问题。

We are not interested in integral calculus or the area under the curve. But the method tells us something about how a single-hidden-layer neural network might get at the problem of approximating any function.

如果我们设计单独的神经单元,每个神经单元由两个或多个神经元组成,每个神经元产生一个输出等于所需大小的矩形的高度,并具有必要的宽度?检查上面四个图中的第一个,它近似于十五个矩形的曲线下面积。如果每个矩形都是一个神经单元的输出会怎样?每个单元将代表一个函数,该函数具有特定值,等于矩形的高度(沿 y 轴的值),对于一小段等于矩形宽度的输入值(沿 x 轴的值)。对于沿 x 轴的所有其他值,单元输出零。我们可以将这些矩形堆叠在一起并将它们相加,这样我们就会得到某个函数的近似值。

What if we designed individual neural units, where each neural unit is made of two or more neurons that each produced an output that equaled the height of a rectangle of a required size and had the necessary width? Examine the first of the four figures above, which approximates the area under the curve for fifteen rectangles. What if each of those rectangles were the output of a neural unit? Each unit would represent a function that had a certain value, equal to the height of the rectangle (the value along the y-axis), for a small range of input values equaling the width of the rectangle (the values along the x-axis). For all other values along the x-axis, the unit outputs a zero. We could just stack these rectangles next to each other and sum them up, and we’d have an approximation of some function.

让我们建立一个可以帮助我们以这种方式近似函数的网络。有关此方法的详细视觉分析,包括一些漂亮的交互式图形,请参阅注释中指向 Michael 的链接Nielsen 的解释非常新颖:“神经网络可以计算任何函数的视觉证明。”Nielsen 使用具有阶跃激活函数(定义见下文)的神经元来开发必要的直觉。我们将直接使用一种“非线性”神经元,该神经元基于 Cybenko 在其证明中使用的神经元。

Let’s build a network that can help us approximate a function in this manner. For a detailed visual analysis of this approach, including some nifty interactive graphics, see the notes for a link to Michael Nielsen’s highly original explanation: “A visual proof that neural nets can compute any function.” Nielsen develops the necessary intuition using neurons that have a step activation function (defined further down). We’ll directly use a type of “nonlinear” neuron, based on the one Cybenko used in his proof.

在上面的例子中,神经元接受两个输入x1x2,并产生一个输出y,这取决于两个处理阶段(偏差b将始终乘以输入 1):

In the example above, the neuron takes in two inputs, x1 and x2, and produces an output y, which depends on two processing stages (the bias b will always be multiplied by the input 1):

z = w 1 x 1 + w 2 x 2 + b

z = w1x1 + w2x2 + b

y = az

y = a(z)

如果a ( z ) = z,我们有一个简单的线性神经元

If a(z) = z, we have a simple linear neuron

az)= z

a(z) = z

y = w 1 x 1 + w 2 x 2 + b

y = w1x1 + w2x2 + b

在正式术语中,函数a ( z ) 称为激活函数。在第 1 章和第 2 章中我们遇到的神经元中,a ( z ) 是阈值函数或阶跃激活函数。此类函数的一个例子是:

In formal lingo, the function a(z) is called an activation function. In the neurons we encountered in chapters 1 and 2, a(z) was a threshold function, or a step activation function. An example of such a function would be:

如果z > 0:a ( z ) = 1

If z > 0: a(z) = 1

否则az)= 0

Else: a(z) = 0

Cybenko 的神经元使用 S 型激活函数a ( z )= σ ( z ),其中:

Cybenko’s neurons used the sigmoid activation function, a(z) = σ (z), where:

对于一维输入x ,该函数的形式如下,即z = wx + b

Here’s what the function looks like, for a 1D input x, such that z = wx + b:

请注意, y 轴上绘制的函数σ ( z ) 从接近 0 平滑地变化到接近 1。(这种平滑性,而不是我们在具有阶跃的函数中看到的突然转变,对于训练具有隐藏层的网络非常重要;我们将在下一章中详细讨论这一点。)在此描述中,上升的中点恰好出现在x =0 处。但是,可以通过更改wb的值来控制该中点以及从接近 0 到接近 1 的上升的陡峭程度。

Note that the function σ(z), plotted on the y-axis, goes smoothly from almost 0 to almost 1. (This smoothness, rather than the abrupt transition we see in the functions that have a step, is important for training networks with hidden layers; it’s something we’ll look at in detail in the next chapter.) In this depiction, the midpoint of the rise occurs exactly at x=0. But that midpoint, as well as the steepness of the rise from near 0 to near 1, can be controlled by changing the values of w and b.

在我们的神经元中,z是输入与偏差项的加权和。因此,神经元的输出y可以这样写:

In the context of our neuron, z is the weighted sum of the inputs plus the bias term. So, the neuron’s output, y, can be written this way:

z = wTx + b

z = wTx + b

y = sz

y = s(z)

通过改变wb,可以改变z的值,从而改变 S 形函数的形状和位置。例如,以下是一维输入和输出的两个不同输出:

By varying w and b, one can vary the value of z, and thus the shape and position of the sigmoid. Here are two different outputs for a one-dimensional input and output, for example:

第一个图显示 S 形曲线向原点右侧移动(由于偏差发生变化),第二个图显示 S 形曲线几乎垂直上升(由于权重增加)并且也向右移动(由于偏差随之发生变化)。在第二个图中,您也可以改变偏差,使急剧上升的曲线向左移动。

The first shows a sigmoid shifted to the right of the origin (because of a change in the bias), and the second shows a sigmoid that rises almost vertically (because of an increase in the weight) and is also shifted to the right (because of an accompanying change in the bias). In the second plot, you could just as well have changed the bias such that the steeply rising curve moves to the left.

下图使用 S 形神经元作为隐藏层的元素,展示了 Cybenko 分析的网络的一维版本(这里的 1D 表示输入和输出向量各只有一个元素;隐藏层中有任意数量的神经元)。

Using a sigmoidal neuron as the element of the hidden layer, the figure below shows a one-dimensional version of the network Cybenko analyzed (with 1D here meaning that both the input and output vectors have only one element each; there is an arbitrary number of neurons in the hidden layer).

基本思想是每个隐藏神经元都会生成一些 S 形曲线,其中曲线的陡度由神经元的权重控制,曲线沿 x 轴上升的位置由神经元的偏差控制。输出神经元只是对隐藏神经元的输出进行线性组合神经元:它将每个隐藏神经元的输出乘以某个权重(可以是负数,导致曲线下降而不是上升),然后将隐藏神经元的加权输出相加以产生最终输出。

The basic idea is that each hidden neuron is generating some sigmoidal curve, where the steepness of the curve is controlled by the neuron’s weight and where the location at which the curve rises along the x-axis is controlled by the neuron’s bias. The output neuron is simply doing a linear combination of the outputs of the hidden neurons: It multiplies each hidden neuron’s output by some weight (which can be negative, causing the curve to fall rather than rise) and then sums up the weighted outputs of the hidden neurons to produce the final output.

在分析此类求和的简单示例之前,让我们先研究一下 Cybenko 研究的单隐层网络的数学形式。该网络具有以下特点:

Before we analyze simple examples of such summations, let’s examine the mathematical formalism of the one-hidden-layer network that Cybenko studied. The network has the following characteristics:

d维的输入向量:x

An input vector of d dimensions: x

隐藏层神经元数量:n

Number of neurons in the hidden layer: n

隐藏层的权重矩阵:W;这是一个d × n矩阵

Weight matrix for the hidden layer: W; this is a d×n matrix

输出:y

Output: y

鉴于这些参数,Cybenko 正在研究的方程式如下:

Given these parameters, the equation Cybenko was examining was this:

括号内的表达式计算结果为在遇到激活函数之前,先将第i个隐藏神经元传递给激活函数。然后,输出通过 S 型激活函数,再乘以权重αi。所有 alpha,i = 1 到n ,合在一起形成输出层的权重。因此,最终输出是n 个隐藏层神经元输出的线性总和。​​Cybenko 想要证明,如果有足够多的隐藏神经元,这个总和可以近似任何所需的函数f ( x )。

The expression inside the parentheses evaluates to the output of the ith hidden neuron before it encounters the activation function. The output is then passed through the sigmoid activation function and then multiplied by the weight αi. All the alphas, i = 1 to n, taken together form the weights of the output layer. So, the final output is a linear summation of the outputs of the n hidden layer neurons. Cybenko wanted to prove that this summation, given enough hidden neurons, could approximate any desired function f(x).

为了理解这种网络中会发生什么,我们可以回到最简单的 1D 输入和输出情况。让我们看看如果我们取任意两个隐藏神经元会发生什么。以下是两种可能的输出:

To understand what happens in such a network, we can switch back to the simplest case of 1D input and output. Let’s look at what happens if we take any two hidden neurons. Here are two possible outputs:

您可以看到第二个隐藏神经元的输出沿 x 轴向右移动。假设输出神经元对两个隐藏神经元的输出进行线性组合,将第一个输出乘以 1,将第二个输出乘以 -1(相当于绕 x 轴翻转),然后将它们相加。粗线显示最终输出:

You can see that the second hidden neuron’s output is shifted to the right along the x-axis. Let’s say that the output neuron does a linear combination of the outputs of the two hidden neurons, multiplying the first output by 1 and the second output by -1 (equivalent to flipping it about the x-axis) and then adding them up. The bold line shows the final output:

我们得到了一个近似矩形的输出。可以用另外两个隐藏神经元做类似的事情,产生一个向右移动得更远、更高更瘦的矩形。虚线和点线分别是两个隐藏神经元乘以 1.5 和 -1.5 的输出;这些是线性求和的系数。实心灰线是这些输出的总和。

We have produced an approximately rectangular output. A similar thing can be done with two other hidden neurons to produce a rectangle that’s shifted farther to the right and is taller and skinnier. The dashed and dotted lines are the outputs of two hidden neurons multiplied by 1.5 and -1.5, respectively; these are the coefficients of the linear summation. The solid gray line is the sum of those outputs.

这是两个彼此相邻的矩形:

Here are the two rectangles next to each other:

当然,输出神经元可以同时对所有四个神经元的输出进行线性求和。我们将得到对页上图所示的粗线,作为最终输出。

The output neuron can of course do a linear summation of the outputs of all four neurons at once. We’d get the bold line shown in the top figure on the page opposite, as the final output.

本质上,我们生成了两个不同高度的矩形和宽度(使用定制的权重和偏差作为隐藏神经元),并使用类似定制的线性系数将它们相加以产生最终输出,这看起来像一些非线性函数。(我们刚才所做的与 Nielsen 关于如何使用具有阶跃激活函数的神经元来生成和求和此类矩形的阐述非常接近;我们使用了 S 型激活函数。)

Essentially, we have generated two rectangles of different heights and widths (using bespoke weights and biases for the hidden neurons) and added them using similarly bespoke linear coefficients to produce the final output, which looks like some nonlinear function. (What we just did hews closely to Nielsen’s exposition of how to use neurons with a step activation function to generate and sum such rectangles; we used the sigmoid activation function.)

这里尝试使用 10 个 S 型神经元来近似函数y = x2

Here’s an attempt at approximating the function y = x2, using 10 sigmoidal neurons:

黑色虚线表示要近似的函数。浅灰色实线和虚线表示各个隐藏神经元的输出。每个隐藏神经元的输出乘以适当的值(线性系数),可以是正值也可以是负值。这些输出从零上升到某个正值(并保持对于某些神经元,上升或下降的幅度为零(在那里),显示为灰色实线,而对于其他神经元,上升或下降的幅度从零下降到某个负值(并保持在那里),显示为灰色虚线。此外,由于各个隐藏神经元的偏差,上升或下降发生在 x 轴上的不同点。将所有这些有偏差和加权的输出相加的结果是线性组合。在我们的示例中,它如下所示:

The black dashed line shows the function to be approximated. The light gray lines, solid and dashed, show the outputs of individual hidden neurons. The output of each hidden neuron is multiplied by an appropriate value (the linear coefficient), either positive or negative. These outputs rise from zero to some positive value (and stay there) for some neurons, shown as gray solid lines, and drop from zero to some negative value (and stay there) for others, shown as gray dashed lines. Also, the rise or fall happens at different points along the x-axis, thanks to the biases of the individual hidden neurons. The result of summing up all those biased and weighted outputs is a linear combination. Here’s what it looks like for our example:

10 个 S 形神经元输出的线性组合(如黑色实线所示)几乎近似于该函数,但显然还不够好。将神经元数量从 10 个增加到 20 个或 100 个(分别是下两个图),这种方法的威力就显而易见了。使用 100 个神经元,从视觉上无法区分实际函数与其近似值。

This linear combination of the outputs of 10 sigmoidal neurons, shown as the black solid line, almost approximates the function, but it’s clearly not good enough. Increase the number of neurons from 10 to 20 or 100 (the next two figures, respectively), and the power of this approach becomes obvious. With 100 neurons, it’s visually impossible to tell apart the actual function from its approximation.

上面近似的函数很简单。下面是一个更复杂的函数及其具有 300 个神经元的近似值。

The function approximated above is simple. Below is a more complex function and its approximation with 300 neurons.

黑色虚线是我们想要近似的函数。S 形神经元的各个输出经过适当的加权和偏置,以灰色显示。最终输出,即这些输出的线性组合,以实心黑色显示。令人惊讶的是,几百个神经元就足以非常接近原始函数。

The dashed black line is the function we want to approximate. The individual outputs of the sigmoidal neurons, appropriately weighted and biased, are shown in gray. The final output, the linear combination of those outputs, is shown in solid black. It’s astonishing to see how a few hundred neurons are sufficient to get very close to the original function.

这里需要注意的是:我手工设计了这些示例中使用的权重和偏差,以展示这些网络的工作原理。实际上,网络必须学习这些参数的正确值。训练算法(例如反向传播)将使用训练数据找到权重和偏差。如果存在一些未知的复杂函数将输入映射到输出或表示数据的概率分布,那么训练网络就类似于找到一组最佳的权重和偏差来近似该函数。

A note of caution here: I hand-designed the weights and biases used in these examples to show how these networks work. In practice, a network would have to learn the correct values for these parameters. A training algorithm, such as backpropagation, will find the weights and biases using training data. If there is some unknown complicated function that maps the inputs to the outputs or represents the probability distribution over the data, then training the network is akin to finding the best possible set of weights and biases to approximate that function.

此外,我们处理了标量输入和标量输出的简单情况。现实世界的问题需要输入向量的维度可以是数万甚至数百万。但无论输入和输出向量的维度如何,相同的想法都适用。

Also, we have dealt with the simple case of a scalar input and a scalar output. Real-world problems require input vectors that can have dimensions of tens of thousands or even millions. But the same ideas hold, regardless of the dimensionality of the input and output vectors.

尽管进行了所有这些分析,我们仍然只是对为什么具有足够多神经元的单隐层神经网络能够逼近任何函数产生了一种直觉。这还不是证据。这需要一些巧妙的数学知识。

Despite all this analysis, we have only developed an intuition for why a single-hidden-layer neural network, with enough neurons, can approximate any function. It’s not proof. That requires some nifty mathematics.

Cybenko 拥有必要的数学才能。特别是,他是泛函分析(即向量运算和函数运算的分析)方面的专家。(正如我们将详细看到的那样,函数是无限维空间中的向量。)1988 年,Cybenko 花了一年的大部分时间研究这个问题,并写了一本一份小型技术报告表明,具有两个隐藏层的网络可以近似任何函数。证明在数学上是严格的。Cybenko 想走得更远。“我有这种感觉“它应该能够用一个隐藏层来实现这一点,”他说。他是对的。

Cybenko had the necessary mathematical chops. In particular, he was an expert in functional analysis, or the analysis of operations of vectors and operations of functions. (Functions, as we’ll see in some detail, are vectors in an infinite-dimensional space.) In 1988, Cybenko worked on the problem for the better part of the year and wrote a small technical report showing that a network with two hidden layers can approximate any function. The proof was mathematically rigorous. Cybenko wanted to go further. “I had this feeling it should be able to do it with one hidden layer,” he said. He was right.

Cybenko 的证明本身对我们来说太复杂了,而且它依赖于其他复杂的定理。我们只能从总体上了解他的工作。但首先,我们来谈谈函数作为向量。

Cybenko’s proof itself is too complex for us, and it relies on other complex theorems. We’ll have to settle for a bird’s-eye view of his work. But first, a small segue to talk about functions as vectors.

函数作为载体

FUNCTIONS AS VECTORS

在本书的所有概念中,函数作为向量的概念可能最令人费解,但它也是我们将遇到的最美丽和最强大的概念之一。以函数y = sin ( x ) 为例。以下是该函数的图,x的范围从 0 到 10 弧度。如何将此函数视为向量?

Of all the concepts in this book, the idea of a function as a vector may cause the most head-scratching, but it’s also among the most beautiful and powerful ideas we’ll encounter. Take the function y = sin(x), for example. What follows is a plot of the function, for x ranging from 0 to 10 radians. How does one think of this function as a vector?

好吧,我们只考虑x值 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]。在 x 轴上的每个位置,函数在 y 轴上都有对应的值。这些值可以写成:[0.0, 0.84, 0.91, 0.14, -0.76, -0.96, -0.28, 0.66, 0.99, 0.41, -0.54]。

Well, let’s just consider the x values [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]. At each of these locations on the x-axis, the function has a corresponding value on the y-axis. These values can be written down as: [0.0, 0.84, 0.91, 0.14, -0.76, -0.96, -0.28, 0.66, 0.99, 0.41, -0.54].

我们刚刚用 11 个数字的序列近似了该函数。该序列是 11 维空间中的向量。

We’ve just approximated the function using a sequence of 11 numbers. This sequence is a vector in 11-dimensional space.

我们对y = cos ( x ) 执行相同操作;如下图所示。

Let’s do the same for y = cos(x); the figure is shown below.

对于 x 轴上的同一组坐标,此函数可以写成:[1.0, 0.54, -0.42, -0.99, -0.65, 0.28, 0.96, 0.75, -0.15, -0.91, -0.84]。这是同一 11 维空间中的不同向量。想象一下 11 维空间中相互正交的 11 个轴。(好吧,我们实际上无法将其形象化,但这些数学空间是存在的。)对 0 到 10(含)之间的 11 个x值求值的函数sinx)和cosx )会变成这个 11 维空间中的向量。

This function, for the same set of coordinates on the x-axis, can be written down as: [1.0, 0.54, -0.42, -0.99, -0.65, 0.28, 0.96, 0.75, -0.15, -0.91, -0.84]. This is a different vector in the same 11-dimensional space. Just imagine 11 axes that are orthogonal to each other in 11D space. (Well, we cannot actually visualize it, but these mathematical spaces exist.) The functions sin(x) and cos(x), evaluated for 11 values of x between 0 and 10 (inclusive), turn into vectors in this 11D space.

我们可以对任何函数执行此操作。首先将函数映射到 xy 平面上,针对 x 轴上的特定值范围,然后确定 x 轴值预选数组的函数值。此输出数组可视为一个向量,其维数由您选择评估函数的点数决定。在我们的示例中,它是沿 x 轴的 11 个位置,范围从 0 到 10(含)。

We can do this for any function. First map the function on the xy plane, for a certain range of values on the x-axis, and then determine the value of the function for a preselected array of x-axis values. This output array can be thought of as a vector whose dimensionality is determined by the number of points at which you chose to evaluate the function. In our example, it’s 11 locations along the x-axis, ranging from 0 to 10, inclusive.

现在又出现了另一个概念上的飞跃,即向无穷大迈进。我们只考虑了 x 轴上的 11 个点,介于端点 0 和 10 之间(含 0 和 10)。但 0 和 10 之间的线段是无限可分的:该间隔中有无数个点。因此,从技术上讲,我们可以得到一个无限的实数序列,表示 0 到 10 之间的每个函数。我们凡人很难想象三维以上的任何东西,更不用说11D 空间。然而,数学家们却热衷于超越这个范围。有些空间的维度是无限的,或者说轴的数量是无限的。因此,任何函数都可以被认为是这个无限维空间中的一个点。

Now comes another conceptual leap, toward infinity. We’ve considered only 11 points on the x-axis, between the end points 0 and 10, inclusive. But the line segment between 0 and 10 is infinitely divisible: There is an infinite number of points in that interval. So, technically, we can have an infinite sequence of real numbers that represent each function between 0 and 10. We mere mortals have problems visualizing anything beyond three dimensions, let alone 11D space. Mathematicians, however, revel in going beyond. There are spaces with infinite dimensions, or with an infinite number of axes. Any function, then, can be thought of as a point in this infinite-dimensional space.

走向无穷大的过程并未止步于此。如果我们不像示例中那样在 0 到 10 之间求函数值,而是让 x 轴在一侧延伸到负无穷大,在另一侧延伸到正无穷大,结果会怎样?沿着本身长度无限的轴,在无数个点处求该函数值,结果会在另一个无限维空间中得到一个点。

The march toward infinity doesn’t stop here. What if, instead of evaluating the function between 0 and 10, as we did in our example, we let the x-axis extend to minus infinity on one side and plus infinity on the other? The function, evaluated at an infinite number of points, along an axis that is itself infinite in length, results in a point in yet another infinite-dimensional space.

让我们再增加一个复杂因素。到目前为止,我们研究过的函数都是一维的:它们接受标量输入并输出标量。但函数可以接受向量作为输入并产生向量作为输出。我们不必为此烦恼,但要知道这是思考神经网络所做事情的最普遍方式:将一个向量转换为另一个向量。以具有一个隐藏层的神经网络为例。输入列向量x乘以隐藏层权重矩阵,产生另一个列向量,该列向量的每个元素都通过 S 型函数,得到另一个列向量。然后,这个列向量(隐藏层的输出)乘以输出层的权重,产生另一个向量:输出向量y

Let’s add one more complication. The functions we have looked at so far have been one-dimensional: They take in a scalar input and spit out a scalar output. But functions can take in vectors as inputs and produce vectors as outputs. We don’t have to fret about this, but know that that’s the most general way to think about what a neural network is doing: transforming one vector to another vector. Take a neural network with one hidden layer. The input column vector, x, is multiplied by a matrix of weights of the hidden layer, producing another column vector, each element of which is passed through the sigmoid function, which results in another column vector. This column vector—the output of the hidden layer—is then multiplied by the weights of the output layer, producing yet another vector: the output vector, y.

Cybenko 更全面地思考了他的单隐层网络。每个隐藏神经元都实现一些 S 形函数。我们刚刚了解到,每个这样的函数本身都是某个无限维空间中的向量。输出神经元实现隐藏层神经元实现的函数向量的线性组合。Cybenko 提出的问题是:如果你对这个任意大的向量执行所有可能的线性组合如果有多个 S 型函数(或者更确切地说,它们相关的向量),你能得到函数向量空间中所有可能的函数(或向量)吗?

Cybenko thought more generally about his one-hidden-layer network. Each hidden neuron implements some sigmoidal function. We’ve just learned that each such function is itself a vector in some infinite-dimensional space. An output neuron implements a linear combination of the vectors of functions implemented by neurons of the hidden layer. The question Cybenko asked was this: If you performed every possible linear combination of this arbitrarily large number of sigmoid functions (or, rather, their associated vectors), could you get to every possible function (or vector) in the vector space of functions?

“向量空间”是一个技术术语,指的是存在于该空间中的对象,例如向量、矩阵和函数。例如,二维向​​量存在于 xy 平面中;三维向量存在于 xyz 坐标空间中。这些对象必须满足某些属性,才能将空间称为向量空间。但我们不需要知道这些细节就可以了解 Cybenko 的方法。

“Vector space” is a technical term, referring to objects such as vectors, matrices, and functions that live in that space. For example, 2D vectors live in the xy plane; 3D vectors live in the xyz coordinate space. These objects must satisfy certain properties in order for the space to be called a vector space. But we don’t need to know those details to get a sense of Cybenko’s approach.

Cybenko 的反证法首先假设一个具有任意大隐藏层的神经网络无法到达函数向量空间中的所有点,这意味着它无法近似所有函数。然后他表明,该假设会导致矛盾,因此是错误的。这不是构造证明,因为 Cybenko 没有证明某个断言。相反,这是典型的归谬法。他首先假设某个命题是正确的,最后却发现该命题是错误的。“我最终得出了一个矛盾,”Cybenko 说。“这个证明不是建设性的。它是一种存在性 [证明]。”

Cybenko’s proof by contradiction starts with the assumption that a neural network with one arbitrarily large hidden layer cannot reach all points in the vector space of functions, meaning it cannot approximate all functions. He then shows that the assumption leads to a contradiction and, hence, is wrong. It wasn’t a proof by construction, in that Cybenko did not prove some assertion. Rather, it was classic reductio ad absurdum. He started by assuming that some proposition was true and ended up showing that the proposition was false. “I ended up with a contradiction,” Cybenko said. “The proof was not constructive. It was an existence [proof].”

由于他证明了只要有足够多的隐藏神经元,神经网络确实可以逼近任何函数,而他所证明的重点是只有一个隐藏层的网络,这显然导致一些研究人员花时间构建只有一个隐藏层的网络,而不是通过增加隐藏层的数量来深入研究。“我没有说你应该使用一层,”Cybenko 说。“人们得出的结论是,你只需要一层。”

Because his proof that neural networks could indeed approximate any function given enough hidden neurons focused on networks with just one hidden layer, it apparently caused some researchers to spend their time building networks with just one hidden layer, rather than go deep by increasing the number of hidden layers. “I didn’t say you should use one layer,” Cybenko said. “People concluded [that you] only need one layer.”

深度学习革命始于 2010 年左右,因为研究人员开始认真对待“深度学习”中的“深度”——并开始将隐藏层的数量增加到远远超过一层。但在 Cybenko 的证明之后,这场革命花了近二十年的时间才开始起飞——公平地说Cybenko 表示,这场革命需要 20 世纪 90 年代所缺乏的其他要素:大量的训练数据和计算能力。

The revolution in deep learning that began around 2010 happened because researchers began to take seriously the “deep” in “deep learning”—and started to increase the number of hidden layers to well beyond one. But it took almost two decades after Cybenko’s proof for this revolution to take off—and to be fair to Cybenko, the revolution needed other ingredients that weren’t available in the 1990s: massive amounts of training data and computing power.

尽管如此,证明这一点仍然很重要。在 1989 年论文的最后段落中,Cybenko 推测,尽管神经网络的近似特性非常强大,但尚不清楚需要多少个神经元才能以足够的精度近似任何函数。“我们非常怀疑绝大多数近似问题将需要天文数字的项,”Cybenko 写道。“这种感觉是基于困扰多维近似理论和统计学的维数灾难。”

Still, the proof was a big deal. In the concluding paragraph of his 1989 paper, Cybenko speculated that while the approximating properties of neural networks were extremely powerful, it was unclear just how many neurons would be required to approximate any function with enough accuracy. “We suspect quite strongly that the overwhelming majority of approximation problems will require astronomical numbers of terms,” Cybenko wrote. “This feeling is based on the curse of dimensionality that plagues multidimensional approximation theory and statistics.”

但是,当今人工智能领域中占主导地位的深度神经网络(拥有数十亿甚至数千亿个神经元和数十甚至数百个隐藏层)正在挑战机器学习的理论基础。首先,这些网络并不像预期的那样容易受到维数灾难的影响,原因尚不完全清楚。此外,大量的神经元以及参数应该会过度拟合数据,但这些网络也违反了这些规则。然而,在我们能够理解这些奥秘之前,我们需要研究一下最初允许研究人员开始训练深度神经网络的算法:反向传播。

But the deep neural networks that are dominating today’s efforts in AI—with billions, even hundreds of billions of neurons and tens, even hundreds of hidden layers—are challenging the theoretical foundations of machine learning. For one, these networks aren’t as susceptible to the curse of dimensionality as was expected, for reasons that aren’t entirely clear. Also, the massive numbers of neurons and, hence, parameters should overfit the data, but these networks flout such rules, too. However, before we can appreciate such mysteries, we need to examine the algorithm that allowed researchers to start training deep neural networks in the first place: backpropagation.

第十章

CHAPTER 10

终结持久神话的算法

The Algorithm that Put Paid to a Persistent Myth

人工智能界有传言称,明斯基和帕珀特从 20 世纪 60 年代末开始就扼杀了神经网络研究,因为他们证明了单层感知器无法解决像 XOR 问题这样简单的问题。我在与现代深度学习革命背后的关键人物之一杰弗里·辛顿交谈时提到了明斯基和帕珀特的证明。辛顿在 20 世纪 60 年代中期开始对神经网络产生兴趣,当时他还在英国读高中。

It’s AI folklore that Minsky and Papert killed research on neural networks, starting in the late 1960s, by proving that single-layer perceptrons could not solve something as simple as the XOR problem. I brought up the Minsky-Papert proof early on in my conversation with Geoffrey Hinton, one of the key figures behind the modern deep learning revolution. Hinton got interested in neural networks in the mid-1960s, when he was still in high school in the United Kingdom.

我说道:“这是在明斯基和帕普特证明(神经网络)不适用于 XOR 之前。”

“This was before the Minsky and Papert proof about [neural networks] not working for XOR,” I said.

“是的,”Hinton 说,但随后立即提出抗议。“我不能放过它不适用于 XOR 的事情。我保留对此提出投诉的权利。”然后他就这么做了。(稍后会详细介绍。)

“Yes,” Hinton said, but then immediately lodged a protest. “I can’t let it go [about] it not working for XOR. I reserve the right to complain about that.” And he did. (More of that later.)

高中时,Hinton 受到一位数学家朋友的影响,这位朋友想知道记忆是如何在大脑中存储的。就在那时,科学家们想出了如何制作 3D 全息图。Hinton 说:“他对大脑可能像全息图一样的想法很感兴趣,因为记忆不是局部的。”当他的朋友在探索记忆是如何存储的时候,Hinton 开始对大脑如何学习感兴趣;他想了解大脑。这种追求这使得他在大学里学习物理和生理学,但他们教的关于大脑的知识只是动作电位或电信号如何沿着神经元的轴突传播。这并没有真正阐明大脑是如何工作的。失望的辛顿转向了哲学。“我以为哲学家们对此有话要说。后来我意识到他们没有,”他告诉我。“他们没有费曼理解心灵的概念——你需要弄清楚如何构建一个心灵才能理解它。”

In high school, Hinton was influenced by a mathematician friend who was wondering how memories are stored in the brain. It was about then that scientists figured out how to make 3D holograms. “He got interested in the idea that the brain might be like a hologram in that memory is not localized,” Hinton said. While his friend was exploring how memories are stored, Hinton became interested in how brains learn; he wanted to understand the mind. This pursuit led him to study physics and physiology at university, but all they taught about the brain was how action potentials, or electrical signals, travel along axons of neurons. Not exactly an elucidation of how the brain works. A disappointed Hinton turned to philosophy. “I thought philosophers had something to say about it. And then I realized they didn’t,” he told me. “They didn’t have the Feynman concept of understanding the mind—that you need to figure out how to build one to understand it.”

沮丧的 Hinton 甚至尝试学习实验心理学——同样,没有成功。“他们设计实验来区分两种假设,这两种假设显然都是无望的,”他说。“这并不能让我满意。”

A frustrated Hinton even tried studying experimental psychology—again, to no avail. “What they did was design experiments to distinguish between two hypotheses, both of which were obviously hopeless,” he said. “That didn’t satisfy me.”

失望之余,他尝试了一些木工活,并抽出时间阅读。他深受唐纳德·赫布的《行为组织》一书的影响。1972年,欣顿加入爱丁堡大学人工智能学院,与克里斯托弗·隆格特-希金斯一起攻读博士学位。一位从剑桥搬到爱丁堡的理论化学家,后来共同创立了机器智能和感知系(这是英国科学研究委员会决定资助一个中心来推进新兴人工智能领域的成果)。

Disenchanted, he dabbled in some carpentry and took some time off to read. He was deeply influenced by Donald Hebb’s book The Organization of Behavior. In 1972, Hinton joined the school of artificial intelligence at the University of Edinburgh, to work for his doctorate with Christopher Longuet-Higgins, a theoretical chemist who had moved from Cambridge to Edinburgh and would later co-found the Department of Machine Intelligence and Perception (an outcome of the United Kingdom Science Research Council’s decision to fund a center to further the emerging field of AI).

辛顿回忆说,朗格特-希金斯对全息图和记忆很感兴趣,并利用神经网络制作全息记忆。但当辛顿来到爱丁堡时,朗格特-希金斯已经改变了立场,从相信神经网络和联结主义转变为认为符号人工智能才是答案。他还招收了另一名学生,并让他研究符号人工智能。“他一直在试图让我改变立场,”辛顿谈到朗格特-希金斯时说道。

Hinton recalled Longuet-Higgins as having been interested in holograms and memories and making holographic memories using neural networks. But by the time Hinton got to Edinburgh, Longuet-Higgins had switched sides, going from believing in neural networks and connectionism to thinking that symbolic AI was the answer. He had also taken on one other student and had gotten him to work on symbolic AI. “He spent the whole time trying to get me to switch,” Hinton said of Longuet-Higgins.

但 Hinton 并不相信使用符号 AI 和逻辑来实现人工智能。“我从来都不喜欢逻辑,”他告诉我说:“我从不相信人们是合乎逻辑的。”但人类是聪明的,所以聪明绝不仅仅是应用逻辑规则的结果,符号人工智能就是利用逻辑规则来操纵符号来得出答案。Hinton 想研究神经网络。他与 Longuet-Higgins 达成协议,让他研究神经网络六个月,并决定如果他到那时还没有做出任何好的东西,他就会转行。“六个月后,我说,‘是的,我没有做出任何好的东西。但我想我可以做到,所以我还需要六个月,’”Hinton 说。“事情一直这样发展下去。”

But Hinton wasn’t convinced about using symbolic AI and logic to achieve artificial intelligence. “I never really liked logic,” he told me. “I never believed people were logical.” But people are intelligent, so intelligence must not be simply the outcome of applying rules of logic, which is what symbolic AI does—use rules of logic to manipulate symbols to arrive at answers. Hinton wanted to work on neural networks. He negotiated an arrangement with Longuet-Higgins to let him do so for six months, determining that if he hadn’t produced anything good by then, he would switch. “And after six months, I said, ‘Yeah, well, I haven’t produced anything good. But I think I’m going to be able to, so I need another six months,’ ” Hinton said. “It kept going like that.”

辛顿确实完成了博士学位。他的工作涉及使用神经网络解决约束优化问题。“但它们没有学习,”他谈到他的神经网络时说道。然而,他坚信多层神经网络有一天可以学习。这是 20 世纪 70 年代中期。那时,明斯基和帕普特已经证明单层感知器无法解决 XOR 问题。辛顿承认他们的证明是实质性的,因为它具有普遍性,并且 XOR 问题是单层感知器无法解决的一类问题的一个具体实例。不过,辛顿并不为之动容。“这基本上是一个骗局,因为他们证明了一种简单的网络无法完成任务。他们没有证据证明更复杂的网络无法完成任务。这只是一种类比:‘既然我们证明了简单的网络无法完成任务,那就忘了它吧,’”他告诉我。“人们上当了。”

Hinton did finish his Ph.D. His work involved solving constrained optimization problems using neural networks. “But they weren’t learning,” he said of his neural networks. He was convinced, however, that multi-layer neural networks could one day be made to learn. This was the mid-1970s. By then, Minsky and Papert had proven that single-layer perceptrons could not solve the XOR problem. Hinton acknowledges that their proof was substantial, in that it was general, and that the XOR problem was a specific instance of a class of problems that single-layer perceptrons couldn’t solve. Still, Hinton wasn’t impressed. “It basically was a con job in that they proved that a simple kind of net couldn’t do things. And they had no proof that a more complicated net couldn’t do them. It was just kind of by analogy: ‘Since we proved the simple nets can’t do it, forget it,’ ” he told me. “And people fell for it.”

Hinton 没有。Rosenblatt 也没有。回想一下我们在第一章中遇到的 Rosenblatt 的学生 George Nagy:Nagy 告诉我们,Rosenblatt 非常清楚训练多层感知器的问题;Rosenblatt 在其 1961 年的著作《神经动力学原理》中提到了这个问题,Hinton 非常热衷于阅读这本书。

Hinton didn’t. Neither had Rosenblatt. Recall our encounter with Rosenblatt’s student George Nagy in chapter 1: Nagy told us that Rosenblatt was well aware of the problem of training multi-layer perceptrons; Rosenblatt addressed it in his 1961 tome, Principles of Neurodynamics, which Hinton had devoured.

在该书的第 13 章中,有一节标题为“反向传播误差校正程序”。在这一节中,Rosenblatt 清楚地列出了三层感知器的问题,它有一个感知层 (S),用于接收输入;一个响应层 (R),用于产生输出;以及两者之间的层 (A),从而形成一个从 S 到 A 再到 R(S → A → R)的网络。Rosenblatt 写道,“这里要描述的过程被称为“反向传播误差校正程序”,因为它从 R 单元的误差中获取线索,如果它无法在响应端快速做出令人满意的校正,则将校正传播回网络的传感端。”

In chapter 13 of that book, a section heading reads, “Back-Propagating Error Correction Procedures.” In that section, Rosenblatt clearly lays out the problem for a three-layer perceptron, which has a sensory layer (S), which receives inputs; a response layer (R), which produces outputs; and a layer (A) in between the two, creating a network that goes from S to A to R (S → A → R). Rosenblatt writes, “The procedure to be described here is called the ‘back-propagating error correction procedure’ since it takes its cue from the error of the R-units, propagating corrections back towards the sensory end of the network if it fails to make a satisfactory correction quickly at the response end.”

基本思想相当简单(当然,事后看来):通过将产生的输出与预期输出进行比较来确定网络产生的错误,然后找出如何根据错误改变网络的权重,以使网络产生正确的输出。我们在第 1 章和第 2 章中遇到的感知器训练算法只能调整最终 A → R 层的权重。Rosenblatt 的三层网络为 S → A 层预先设定了一组权重。但他明白,这些权重也应该适合训练。“如果能够通过学习过程以某种方式优化 S 到 A 连接的值,而不是接受感知器开始时的任意或预先设计的网络,那么似乎可以获得显着的性能提升,”他写道。

The basic idea is rather straightforward (in hindsight, of course): Determine the error made by the network by comparing the produced output with the expected output and then figure out how to change the weights of the network based on the error such that the network produces the correct output. The perceptron training algorithm we encountered in chapters 1 and 2 can adjust the weights only of the final A → R layer. Rosenblatt’s three-layer network had a predetermined set of weights for the S → A layer. But he understood that those weights, too, should be amenable to training. “It would seem that considerable improvement in performance might be obtained if the values of the S to A connections could somehow be optimized by a learning process, rather than accepting the arbitrary or pre-designed network with which the perceptron starts out,” he writes.

他的书中提出了一种这样的程序,即在修改 A → R 层的权重不够的情况下反向传播误差。但效果并不好。即便如此,罗森布拉特还是引入了反向传播作为训练多层网络的一种手段的想法,但还没有弄清楚具体如何做到这一点。(他使用这样的层构建了我们在第 1 章中看到的 Mark I 感知器,没有反向传播。)

His book lays out one such procedure, to backpropagate errors, if modifying the weights of the A → R layer isn’t enough. But it didn’t work well. Even so, Rosenblatt had introduced the idea of backpropagation as a means for training multi-layer networks, but hadn’t figured out exactly how to do it. (He built the Mark I Perceptron we saw in chapter 1 using such layers, sans backpropagation.)

他还发现了训练神经网络的另一个问题。考虑这个三层神经网络:

He had also identified another problem with training neural networks. Consider this three-layer neural network:

现在,让我们先把训练这样一个网络的问题放在一边。假设在训练开始之前,所有的权重都初始化为零。(为了简单起见,我们在这里忽略了与每个神经元相关的偏差项。)这意味着对于任何给定的输入x = [ x1 , x2 , x3 ],每个隐藏神经元都会产生相同的输出。最后一层的神经元会产生一些输出。我们计算误差并更新每个权重,使误差稍微减少一点。因为初始权重都是相同的,所以每个权重的变化都是相同的,使得它们在更新后仍然相等。这种情况会一直持续到网络收敛到一组适用于训练数据集的权重。不幸的是,每个隐藏神经元的权重将具有相同的值集,因此每个隐藏神经元将以与其他每个隐藏神经元完全相同的方式处理输入数据;本质上,所有神经元都学到了同样的东西。换句话说,它们都从数据中获得了相同的特征。由于隐藏层中神经元权重的对称性,我们也可以只使用一个隐藏神经元。

For now, let’s set aside the problem of training such a network. Let’s assume that before training begins, all the weights are initialized to zero. (We are ignoring the bias term associated with each neuron here, for simplicity.) That means that for any given input x = [x1, x2, x3], each hidden neuron is going to produce the same output. The neuron in the final layer produces some output. We calculate the error and update each weight such that the error is reduced a little. Because the initial weights were all the same, the change to each weight will be identical, making them equal to each other even after the update. This continues until the network converges on a set of weights that work well for the training dataset. Unfortunately, the weights for each hidden neuron will have the same set of values, so each hidden neuron will process the input data in exactly the same manner as every other hidden neuron; in essence, all neurons have learned the same thing. To put it differently, they have all picked up on the same feature in the data. We could just as well have used only one hidden neuron because of the symmetry in the weights of the neurons in the hidden layer.

Rosenblatt 认识到了神经网络中的对称性问题。他用一个例子证明了一个三层如果神经网络从对称权重开始,并使用确定性程序来更新权重,那么它就无法解决简单的问题。“虽然该定理表明确定性程序不能保证有效,但非确定性程序是否有效仍有待证明。在最极端的情况下,只要错误继续发生,我们就可以采用一种随机改变每个连接的值的程序,独立于其他连接,”他写道。

Rosenblatt recognized this problem of symmetry in neural networks. He proved, using an illustrative example, that a three-layer neural network could not solve a simple problem if it started off with symmetric weights and used a deterministic procedure to update the weights. “While this theorem shows that a deterministic procedure cannot be guaranteed to work, it remains to be shown that a non-deterministic procedure will work. In the most extreme case, we could employ a procedure which randomly varies the value of every connection, independently of the others, as long as errors continue to occur,” he writes.

罗森布拉特建议采用随机过程来更新权重。辛顿认为这意味着神经元的输出应该是随机的,这意味着在神经元的输出中引入随机性元素。这将确保在每次训练数据传递过程中权重的更新方式不同,并且最终训练好的网络具有必要的不对称性,从而使神经元能够检测数据中的不同特征。

Rosenblatt was suggesting a stochastic process for updating the weights. Hinton read this to mean that the output of the neurons should be stochastic, which meant introducing an element of randomness to a neuron’s output. This would ensure that the weights were updated differently during each pass of the training data and that the final trained network would have the requisite asymmetry, enabling the neurons to detect different features in the data.

“我被他的论点说服了。神经元必须是随机的,”Hinton 说。因此,他继续考虑用随机神经元来训练多层神经网络。但这种打破对称性的方法行不通。“这让我慢了一段时间。”

“I was convinced by his argument. The neurons had to be stochastic,” Hinton said. So, he continued thinking about training multi-layer neural networks with stochastic neurons in mind. But this method of breaking symmetry didn’t work. “This slowed me down for a while.”

还有另一种更优雅的打破对称性的方法,罗森布拉特关于非确定性程序的断言中也暗示了这一点,但直到辛顿开始与圣地亚哥大学的心理学家戴维·鲁梅尔哈特合作时,他才明白这一点。鲁梅尔哈特会指出更简单的解决方案。在计算机科学家罗纳德·威廉姆斯的帮助下,他们的共同努力将产生现代版本的反向传播算法。但我们已经领先了。辛顿从爱丁堡到圣地亚哥与鲁梅尔哈特合作的道路并不平坦。

There’s another, far more elegant way to break symmetry that’s also hinted at in Rosenblatt’s assertion about nondeterministic procedures, but it wouldn’t become clear to Hinton until he began working with psychologist David Rumelhart at the University of San Diego. Rumelhart would point out the simpler solution. Their combined effort, with help from computer scientist Ronald Williams, would lead to the modern version of the backpropagation algorithm. But we are jumping ahead. Hinton’s path from Edinburgh to San Diego, to work with Rumelhart, wasn’t straightforward.

1977 年,Hinton 提交了他的博士论文。尽管他坚信神经网络,但他在美国却得不到任何支持。英国。“我对似乎没有人相信神经网络的事实感到厌烦,于是我放弃了学术界,去了一所免费学校教书。”那是伦敦伊斯灵顿的白狮街免费学校。辛顿教贫民窟的孩子们数学基础知识。学校甚至没有纸,所以他用论文初稿的背面来教孩子们。在学校待了大约六个月后,在等待论文期末考试的时候,他开始考虑重返学术界。他在英国几乎没能得到面试机会;只有萨塞克斯大学面试了他发展心理学系的职位,但被拒绝了。萨塞克斯的一位学者建议辛顿复印他的论文,并把它们寄给他在美国听说过的每个人。“因为人工智能就在那里,”辛顿说。

Hinton handed in his Ph.D. thesis in 1977. Despite his unwavering belief in neural networks, he found no support in the United Kingdom. “I was so fed up with the fact that nobody else seemed to believe in neural nets that I abandoned academia and went and taught in a free school.” It was the White Lion Street Free School in Islington, London. Hinton taught inner-city kids the basics of mathematics. The school didn’t even have paper, so he used the backsides of the pages from early drafts of his thesis to teach the kids. After about six months at the school, while waiting for the final exam for his thesis, he began thinking of getting back into academia. He could barely get an interview in Britain; only the University of Sussex interviewed him for a position, in the Department of Developmental Psychology, and rejected him. An academic at Sussex suggested that Hinton make reduction photocopies of his thesis and mail them to everyone he had heard of in the United States. “Because that’s where AI was,” Hinton said.

鲁梅尔哈特阅读了辛顿的论文,并向他提供了加州大学圣地亚哥分校的博士后职位。对于辛顿来说,美国是继英国学术“单一文化”之后的启示,在英国,只有正确的做事方式,其他一切都被视为异端。神经网络构成了异端。“而美国比那更大。特别是,它有两个海岸。它们彼此之间可能都是异端,”辛顿说。

Rumelhart read Hinton’s thesis and offered him a postdoctoral position at UC San Diego. For Hinton, the United States was a revelation after the academic “monoculture” of Britain, where there was the right way to do things and where everything else was considered heresy. Neural networks constituted heresy. “And the U.S. is bigger than that. In particular, it’s got two coasts. They can each be heresy to the other,” Hinton said.

鲁梅尔哈特对神经网络非常感兴趣。对于辛顿来说,这里的氛围非常棒。“我从未去过一个不认为神经网络是无稽之谈的地方。”当时是 20 世纪 80 年代初。对神经网络感兴趣的人专注于训练至少有一个隐藏层的多层网络。到现在,所谓的反向传播算法的轮廓已经清晰起来。

Rumelhart was keenly interested in neural networks. For Hinton, the atmosphere was wonderful. “I’d never been anywhere where neural nets weren’t considered to be nonsense.” This was now the early 1980s. Those interested in neural networks were preoccupied with training multi-layer networks with at least one hidden layer. By now, the contours of what came to be called the backpropagation algorithm were clear.

我们将讨论精确的数学细节,但从概念上讲,这是算法的最终结果。想象一下一个三层神经网络具有一个隐藏层的网络。您给它一个输入,它就会产生一个输出。您计算网络产生的误差,即其输出与预期正确值之间的差异。此误差是网络所有权重的函数。如何最小化误差?好吧,您可以使用 Bernard Widrow 的简单 ADALINE 网络进行梯度下降 - 我们在第 3 章中遇到的技术。找到误差的梯度(作为权重的函数),并通过将每个权重更新一点点,向相反方向迈出一小步。

We’ll come to the exact mathematical details, but conceptually, here’s the algorithm’s end game. Think of a three-layer neural network with one hidden layer. You feed it an input, and it produces an output. You calculate the error made by the network, which is the discrepancy between its output and the expected correct value. This error is a function of all the weights of the network. How do you minimize the error? Well, you can do gradient descent—the technique we encountered in chapter 3, with Bernard Widrow’s simple ADALINE network. Find the gradient of the error (as a function of the weights) and take a small step in the opposite direction by updating each weight by a tiny amount.

听起来很简单。你可能会问,这样做有什么问题?首先,你下降的函数形状不一定是凸的。在 Widrow-Hoff 算法中,函数是碗状的,梯度下降保证让你到达碗底,即全局最小值,这代表网络可能产生的最小误差,因此也是其权重的最优值。但事实证明,对于具有隐藏层的神经网络,误差函数不是凸的;它有许多山丘和山谷。网络可能会卡在其中一个山谷,即局部最小值,即使在误差较低的其他山谷或最小值中也存在。

Sounds easy. Where’s the problem with doing just that? you might ask. For starters, the shape of the function you’re descending is not necessarily convex. In the Widrow-Hoff algorithm, the function is bowl-shaped, and gradient descent is guaranteed to get you to the bottom of the bowl, the global minimum, which represents the lowest possible error that the network can make and, hence, the optimal value for its weights. But it turns out that the error function in the case of a neural network with hidden layers is not convex; it has many hills and valleys. And it’s possible that the network can get stuck in one of the valleys, a local minimum, even though other valleys, or minima, exist where the error is lower.

事实上,明斯基本人在研究神经网络之前就已经研究过这个过程的特性。他和另一位人工智能先驱奥利弗·塞尔弗里奇在 1961 年共同撰写的论文《随机网络中的学习》中,写到了一种叫做爬山算法的算法,它类似于梯度下降,因为你要寻找一个函数的峰值,其中函数代表性能:函数的值越大,机器在执行任务时就越好。“让机器对一个或几个参数、控件或变量进行一些小的更改。如果性能有所改善,则重复该过程;如果没有改善,则返回到上一个状态并做出不同的小改变。从长远来看,[,] 性能必须提高到局部最优,此时控制中的任何小变化都不会带来改进。这种技术通常被称为‘爬山法’,”作者写道。就像在具有多个局部最小值的非凸函数上进行梯度下降会产生问题一样,爬山法也会遇到明斯基和塞尔弗里奇所说的台地现象:“这个空间显然是由大量平坦区域组成的。平坦的高架区域可能被认为是‘台地’或‘台地’。”对参数值的微小调整不会导致机器性能的任何改善,这意味着机器卡在了台地上;或者它们会导致性能的巨大变化,这类似于机器从台地上掉到下坡上。明斯基实际上排除了爬山法作为一种可行的方法。

In fact, Minsky himself, before he turned on neural networks, had studied the properties of this process. He and another AI pioneer, Oliver Selfridge, in a paper called “Learning in Random Nets,” which they co-authored in 1961, wrote about an algorithm called hill climbing, which is analogous to gradient descent, in that you are looking to find the peak of a function, where the function represents performance: The greater the value of the function, the better the machine is at its task. “Let the machine make some small changes in one or a few of its parameters or controls or variables. If the performance improves, repeat the process: if not, return to the previous state and make a different small change. In the long run[,] performance must improve to a local optimum where no small change in controls yields improvement. This technique is commonly referred to as ‘hill-climbing,’ ” the authors wrote. Just as doing gradient descent on a non-convex function with multiple local minima creates problems, hill climbing can encounter what Minsky and Selfridge called the mesa phenomenon: “The space apparently is composed of large numbers of flat regions. The flat elevated regions might be thought of as ‘table-lands’ or ‘mesas.’ ” Small tweaks to the values of parameters do not result in any improvement in the machine’s performance, which implies that the machine is stuck on the mesa; or they lead to large changes in performance, which is akin to the machine’s falling off the mesa onto a downslope. Minsky effectively ruled out hill climbing as a viable method.

这或许可以解释明斯基和帕普特对多层神经网络的悲观看法。这是比较宽容的解释。不太宽容但可能更准确的解释是,两人故意破坏神经网络研究,以便资金流向他们青睐的人工智能形式——符号人工智能。“明斯基和帕普特一心想消灭所有竞争……他们的书暗示的内容比实际展示的内容要多得多。他们着手分析单层感知器的容量,而完全忽略了罗森布拉特在书中数学部分关于多层机器的章节以及他对基于误差反向传播的概率学习算法收敛性的证明,”加州大学伯克利分校的哲学教授 Hubert L. Dreyfus 和他的兄弟、工业工程和运筹学教授 Stuart E. Dreyfus 写道。

This may explain Minsky and Papert’s dismal view of multi-layer neural networks. That’s the more charitable interpretation. The less charitable, and maybe more accurate, one is that the duo was deliberately sabotaging research into neural networks so that funding could flow to their favored form of artificial intelligence, symbolic AI. “Minsky and Papert were so intent on eliminating all competition…that their book suggests much more than it actually demonstrates. They set out to analyze the capacity of a one-layer perceptron, while completely ignoring in the mathematical portion of their book Rosenblatt’s chapters on multi-layer machines and his proof of convergence of a probabilistic learning algorithm based on back propagation of errors,” write professor of philosophy Hubert L. Dreyfus and his brother, Stuart E. Dreyfus, professor of industrial engineering and operations research, both at the University of California, Berkeley.

但好主意永远不会消失。1960-61 年,控制和航空工程师 Henry J. Kelley 和 Arthur E. Bryson独立地找到了一种计算方法,例如,计算火箭的最佳轨迹,这种方法本质上具有反向传播算法所需的思想。1962 年,Stuart Dreyfus 基于微积分中的链式法则(我们稍后会讲到)导出公式,以增强 Kelley-Bryson 方法的实用性。在一篇详尽介绍该算法历史的博客文章中,瑞士卢加诺-维加内洛人工智能实验室 IDSIA 的人工智能先驱 Jürgen Schmidhuber 提到了许多有类似想法的人。例如,1967 年,Shun'ichi Amari 展示了使用随机梯度下降训练具有隐藏单元的多层感知器的技术;而 Seppo Linnainmaa 在其 1970 年的硕士论文中开发了高效反向传播的代码。1974 年,Paul Werbos 在哈佛大学提交了他的博士论文。这篇论文的标题是《超越回归:行为科学预测和分析的新工具》,最接近于现代版本的反向传播算法。这篇论文并没有得到太多的宣传;它也不是针对神经网络研究人员的。尽管取得了这些进展,但就机器学习而言,它们都没有留下任何痕迹。那是在 20 世纪 80 年代初,当时 Rumelhart、Hinton 和 Williams 开发了他们的算法,正是这个版本为深度神经网络提供了所需的推动力。要欣赏这些发展,我们需要穿上涉水靴,踏入微积分、导数和链式法则的缓流之中。

But good ideas never really go away. In 1960–61, control and aeronautics engineers Henry J. Kelley and Arthur E. Bryson independently arrived at a method for calculating, for example, the optimal trajectory for a rocket, a method that had, in its essence, the ideas necessary for the backpropagation algorithm. In 1962, Stuart Dreyfus derived formulas based on the chain rule in calculus (we’ll come to that in a moment) to augment the usefulness of the Kelley-Bryson method. And in an exhaustive blog post detailing the history of the algorithm, Jürgen Schmidhuber, an AI pioneer at the Swiss AI Lab IDSIA in Lugano-Viganello, mentions many others who had similar ideas. For example, in 1967, Shun’ichi Amari demonstrated techniques for using stochastic gradient descent to train multi-layer perceptrons with hidden units; and Seppo Linnainmaa, in his 1970 master’s thesis, developed the code for efficient backpropagation. In 1974, Paul Werbos submitted his Ph.D. thesis at Harvard. Titled Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Sciences, it came closest to articulating the modern version of the backpropagation algorithm. The thesis wasn’t publicized much; nor was it aimed at researchers in neural networks. Despite such developments, none of them made their mark as far as ML was concerned. That happened in the early 1980s, when Rumelhart, Hinton, and Williams developed their algorithm, and it’s this version that gave deep neural networks the boost they needed. To appreciate these developments, we need to put on our wading boots and step into a gentle stream of calculus, derivatives, and the chain rule.

增量是多少?

WHAT’S THE DELTA?

让我们重新审视一下均方算法(我们在第 3 章中遇到过的一个版本,以 Widrow-Hoff 算法的形式),该算法用于学习某项任务中单个神经元的权重。

Let’s revisit the mean square algorithm (a version of which we encountered in chapter 3, in the guise of the Widrow-Hoff algorithm) for learning the weights of a single neuron for some task.

下面显示的神经元具有权重w和偏差b,接受标量输入x,并生成标量输出y

The neuron shown below has a weight w and a bias b, takes a scalar input x, and generates a scalar output y.

输出y由以下公式给出:

The output y is given by:

y = wx + b

y = w.x + b

使用这个神经元,我们想要解决的问题如下。我们在 xy 平面上给出了一组点,对于x的每个值,都有一个对应的 y 值假设我们给出了xy之间关系的十个代表点。以下是训练数据:

Using this neuron, here’s the problem we want to solve. We are given a set of points on the xy plane such that for every value of x, there’s a corresponding value of y. Let’s say we are given ten representative points for the relationship between x and y. Here are the training data:

只需目测即可清楚看出,xy之间的最佳关系是线性的。这样的直线具有斜率和偏移量(直线与原点的距离),这正是权重w和偏差b所代表的。一旦我们找到斜率和偏移量,给定一个新的x,我们就可以预测y。我们正在执行线性回归 - 即,在给定一些新输入的情况下,找到一条与训练数据最相符的直线,以便预测输出。

It’s clear from just a visual inspection that the best relationship between x and y is linear. Such a straight line has a slope and an offset (the distance of the line from the origin), which is exactly what the weight w and the bias b represent. Once we find the slope and offset, then given a new x, we can predict y. We are performing a linear regression—that is, finding a straight line to best fit the training data in order to predict the output, given some new input.

这是用于查找权重和偏差的所谓 delta 规则。(只要神经元都在一个层中,就可以将其推广到多组权重和偏差。)

Here’s the so-called delta rule for finding the weight and bias. (One can generalize this to multiple sets of weights and biases as long as the neurons are all in a single layer.)

初始化wb

Initialize w and b:

w = 0, b = 0

w = 0, b = 0

计算神经元的输出。

Calculate the output of the neuron.

yhat = wx + b

yhat = wx + b

计算误差:

Calculate the error:

e = y - yhat,其中y是预期值

e = y - yhat, where y is the expected value

计算平方损失:

Calculate the square loss:

损失= ( y - yhat ) 2

loss = (y - yhat)2

损失= ( y - wx - b ) 2

loss = (y - wx - b)2

如果我们将损失绘制为权重w和偏差b的函数,它将如下图所示。

If we plot the loss as a function of the weight w and bias b, it’d look like the figure shown below.

权重和偏差分别沿 x 轴和 y 轴变化。z 轴上的高度是给定权重和偏差以及某些训练数据集的损失。在本例中,我们有十对 (x, y) 点组成我们的训练数据。对于每一对,我们都可以计算损失。然后,我们将所有对相加并除以十,得到均方误差 (MSE)。我们在 z 轴上绘制的就是这个值。这里需要注意的是,对于给定的权重和偏差,损失对于不同的训练数据点集会有所不同。换句话说,虽然损失函数的形状取决于损失与权重和偏差的关系,但损失的确切值还取决于训练数据。从损失函数的形状(如图所示)可以清楚地看出,如果我们将权重和偏差随机初始化为某些值,我们很可能会落在斜坡的某个地方,而不是底部。

The weight and bias vary along the x-axis and y-axis, respectively. The height along the z-axis is the loss for a given weight and bias and for some set of training data. In this case, we have ten pairs of (x, y) points that comprise our training data. For each pair, we can calculate the loss. Then we sum over all the pairs and divide by ten to get the mean squared error (MSE). It’s this value that we plot on the z-axis. It’s important to note here that the loss you get for a given weight and bias would be different for a different set of training data points. In other words, while the shape of the loss function depends on the relationship of the loss to the weight and bias, the exact value of the loss also depends on the training data. It’s clear from the loss function’s shape, as plotted, that if we initialize the weight and bias randomly to some values, we’ll most likely land somewhere along the slopes, rather than at the bottom.

增量规则告诉我们,要到达底部,我们需要计算损失函数在任意给定点的梯度,并沿着梯度的负值迈出一小步,因为梯度是一个指向上坡的矢量。一小步意味着调整权重和偏差,调整量与梯度的各个分量成比例。我们一直这样做,直到损失变得可以接受。

The delta rule tells us that to reach the bottom, we need to calculate the gradient of the loss function at any given point and take a small step along the negative of the gradient, because the gradient is a vector that points upslope. A small step means tweaking the weight and the bias by a tiny amount that’s proportional to the respective components of the gradient. We keep doing this until the loss becomes acceptably small.

计算如下:

Here’s the calculus:

回想一下我们在第 3 章中对微积分的简要讨论,多元函数(在本例中为损失函数L,它取决于wb)的梯度是一个向量:向量的每个元素都是该函数关于一个变量的偏导数,所有其他变量都视为常数。

Recall from our brief discussion of calculus in chapter 3 that the gradient of a multi-variate function (in this case, the loss function L, which depends on w and b) is a vector: Each element of the vector is a partial derivative of the function with respect to one variable, with all other variables treated as constants.

我们从损失函数开始:

We start with our loss function:

L = ( y - yhat ) 2 = ( y - wx - b ) 2

L = (y - yhat)2 = (y - wx - b)2

为了进行这种区分,我们需要微积分中的两个简单规则。第一个规则称为幂律:

To do this differentiation, we need two simple rules from calculus. The first is called the power rule:

假设y = x n

Say, y = xn

因此,如果y = x 3

So, if y = x3

接下来是链式法则,它将在我们即将构建的反向传播算法中发挥巨大作用。让我们慢慢来。

The next is the chain rule, which will play an outsize role in the backpropagation algorithm we are building up to. Let’s take our time with it.

如果:

If:

y = f ( z ) 和z = g ( x )

y = f(z) and z = g(x)

y = f ( g ( x ))

y = f(g(x))

然后根据链式法则:

then, according to the chain rule:

换句话说,如果函数f ( z ) 依赖于变量 z,而变量z又依赖于另一个变量x,那么函数f ( z ) 对第二个变量x的导数可以通过将f ( z ) 对第一个变量z的导数和z对第二个变量x 的导数串联在一起来计算原则上,这个链可以任意长,而这个特性使反向传播算法独树一帜。

In words, what this is saying is that if a function f(z) depends on a variable, z, which itself depends on another variable, x, then the derivative of the function f(z) w.r.t. the second variable x can be calculated by chaining together the derivative of f(z) w.r.t. the first variable, z, and the derivative of z w.r.t. the second variable, x. The chain, in principle, can be arbitrarily long, and it’s this feature that gives the backpropagation algorithm its chops.

但首先,这里有一个简单的例子:

But first, here’s a simple example:

如果

If

z = x 2y = sin( z )

Let z = x2y = sin(z)

现在我们来考虑损失函数以及损失函数关于权重w和偏差b的导数。请注意符号从常规符号变为曲线符号曲线符号表示我们对一个特定变量求函数的偏导数;函数本身依赖于多个变量。

Let’s now consider our loss function and the derivative of the loss function with respect to the weight, w, and the bias, b. Note the change in notation from using the regular to the curved : The curvy notation denotes that we are taking a partial derivative of a function with respect to one specific variable; the function itself depends on multiple variables.

L = ( y - wx - b ) 2

L = (y - wx - b)2

L = e 2,其中e = (y - wx - b

L = e2, where e = (y - wx - b)

相似地:

Similarly:

,因为ywx是相对于b 的常数

, since y, w, and x are constants w.r.t. b

因此,损失函数上某一点的梯度为:

So, the gradient at a point along the loss function is:

对于wb的任何值,以及任何 (输入,输出) 或 (x, y) 对,我们都可以计算梯度。我们对所有数据对执行此操作,将梯度相加,然后除以数据点的总数,以在给定训练数据的情况下获得损失函数沿线任何位置的整体梯度。

For any value of w and b, and for any (input, output), or (x, y) pair, we can calculate the gradient. We do this for all pairs of data, sum up the gradients, and divide by the total number of data points, to get the overall gradient at any location along the loss function, given the training data.

这是更新规则(之所以称为增量规则,是因为它以较小的量 delta增加了wb ):

Here’s the update rule (it’s called the delta rule because it increments w and b by a small amount, delta):

ω = ω + Δω

w = w + Δw

相似地,

Similarly,

b = b + Δb

b = b + Δb

实际上,增量本身会乘以一个称为学习率 alpha 的小数,这样权重和偏差只会根据梯度的一小部分进行调整。

In practice, the deltas are themselves multiplied by a small number called the learning rate, alpha, so that the weights and biases are adjusted by only a small fraction of the gradient.

其中,α = 学习率,某个较小的值,例如 0.01

where, α = the learning rate, some small value, say 0.01

更新

Update w:

ω = ω + Δω

w = w + Δw

相似地,

Similarly,

更新b

Update b:

b = b + Δb

b = b + Δb

更新权重和偏差,重新评估损失,并继续执行此操作,直到损失低于可接受的限度;然后停止。我们将找到wb的合理值,该值可以很好地拟合训练数据。对于我们的初始数据,这样的线可能如下所示:

Update the weight and bias, reevaluate the loss, and keep doing this until the loss falls below an acceptable limit; then stop. We’ll have found a reasonable value for w and b that nicely fits the training data. Here’s what such a line might look like, for our initial data:

我们刚才做的是梯度下降,下降到损失函数的最小值或最小值附近的位置。这是针对单个神经元的,具有一个权重和一个偏差项。这种简单的配置使我们能够将损失绘制为两个参数wb 的函数。但我们可以轻松地将其扩展到具有2 个或 10 个甚至 100 个输入。我们的解决方案的另一个重要方面是它是线性的。我们通过数据点拟合一条直线;我们进行了线性回归。

What we just did was gradient descent, going down to the minimum of the loss function or to a location near the minimum. It was for a single neuron, with one weight and a bias term. This simple configuration allowed us to plot the loss as a function of the two parameters, w and b. But we can easily extend this to a neuron with 2 or 10 or even 100 inputs. The other important aspect of our solution is that it’s linear. We fit a straight line through the data points; we did linear regression.

我们可以很容易地找到一条直线将 xy 平面分成两个区域,一个区域包含一类数据(圆圈),另一个区域包含第二类数据(三角形)。例如,这里有这样一个数据集:

We could just as easily have found a straight line to separate the xy plane into two regions, one with one class of data (circles) and the other with the second class of data (triangles). For example, here’s such a dataset:

我们在之前的多个章节中都见过这个问题。问题是找到线性分离超平面——在本例中是这个二维空间中的一条直线。这里,每个数据点由x1 , x2给出(其中x1是沿 x 轴的值,x2是沿 y 轴的值)。

We have seen this problem before, in multiple previous chapters. The problem is to find the linearly separating hyperplane—in this case, a straight line in this 2D space. Here, each data point is given by x1, x2 (where x1 is the value along the x-axis and x2 is the value along the y-axis).

我们需要一个接受两个输入(x1x2)并计算输出的神经元:

We’ll require a neuron that takes in two inputs (x1, x2) and computes an output:

yhat = w 1 x 1 + w 2 x 2 + b

yhat = w1x1 + w2x2 + b

损失:

The loss:

L = ( y - yhat ) 2 = ( y - ( w 1 x 1 + w 2 x 2 + b )) 2

L = (y - yhat)2 = (y - (w1x1 + w2x2 + b))2

L = y 2 - 2 y ( w 1 x 1 + w 2 x 2 + b ) + ( w 1 x 1 + w 2 x 2 + b ) 2

L = y2 - 2y (w1x1 + w2x2 + b) + (w1x1 + w2x2 + b)2

我们已经遇到过链式法则。现在是时候利用它了:

We have already encountered the chain rule. Now it’s time to make use of it:

L = ( y - yhat ) 2 = e 2

L = (y - yhat)2 = e2

所以:

So:

相似地:

Similarly:

and

因此,损失函数中某一点的梯度是:

So, the gradient at some point along the loss function is:

再次,我们将按以下方式更新权重和偏差:

Again, here’s how we’d update the weights and the bias:

我们的算法对所有测试数据进行迭代,直到损失接近碗状损失函数的底部(遗憾的是,我们无法将其可视化,因为损失现在取决于三个变量,w1w2b;绘图将是四维的)。一旦损失达到最优,即它可以接近于零,我们最终会得到权重和偏差,从而给出一条分界线(见下图)。

Our algorithm iterates over all the test data until the loss ends up near the bottom of the bowl-shaped loss function (which we cannot, unfortunately, visualize, as the loss now depends on three variables, w1, w2, and b; the plot would be in four dimensions). Once the loss is optimal, meaning it’s acceptably close to zero, we end up with weights and a bias that give us a dividing line (see figure, below).

我们从一个具有一个权重和一个偏置项的神经元变成了一个具有两个权重和一个偏置项的神经元。如果我们要将猫的图像与狗的图像区分开来,每幅图像有 100 个像素,而代表猫图像的点聚集在一个100 维空间中的某个区域,而另一个区域中的点代表狗?我们可以简单地让一个神经元接收 100 个输入,每个输入对应一个像素值!只要数据是线性可分的(对于分类问题),或者我们想通过数据点绘制一条线(或超平面)进行回归,我们迄今为止看到的方法就足够了。

We went from a neuron with one weight and a bias term to a neuron with two weights and a bias term. What if we had to separate out images of cats from images of dogs, where each image had 100 pixels, with the points representing the cat images clustered in one region of 100-dimensional space and the points representing dogs in another region? We could simply have a neuron that took in 100 inputs, one for each pixel value! As long as the data are linearly separable (for a classification problem), or we want to draw a line (or hyperplane) through the data points for regression, the method we have seen so far is sufficient.

但是——这也是明斯基和帕普特大惊小怪的原因——如果数据不是线性可分的怎么办?我们迄今为止使用的方法还能行得通吗?我们知道答案:不行。现在是时候真正理解为什么会这样,并超越明斯基和帕普特的牵强反对意见了。

But—and this is what Minsky and Papert created an unholy fuss about—what if the data are not linearly separable? Does the method we have used thus far still work? We know the answer: No, it doesn’t. It’s time to really understand why and to go beyond Minsky and Papert’s tenuous objections.

非线性

A TOUCH OF NONLINEARITY

考虑下面第一个面板中显示的数据集;它是 Minsky 和 ​​Papert 的 XOR 问题的一个版本。

Consider the dataset shown in the first panel below; it’s a version of Minsky and Papert’s XOR problem.

没有一条线可以清晰地将圆形和三角形区分开来。我们需要一个神经网络,它可以将 xy 平面分成两个区域,即明区和暗区,这样落在暗区的数据点就可以被分类为三角形;否则,就是圆形。这不是一个简单的问题。

No single line can cleanly separate the circles from the triangles. What we need is a neural network that can separate the xy plane into two regions, light and dark, such that a data point that falls in the dark region can be classified as a triangle; otherwise, as a circle. This is not a trivial problem.

我们知道一个神经元可以找到一条线权重和偏差项。显然,我们首先需要两个这样的神经元来找到这两条线。上面显示的是这两条线可能的样子。(我说“可能”是因为每次训练神经元时,你可能会得到斜率和偏移量略有不同的线。)

We know that a single line can be found by one neuron with weights and a bias term. Clearly, we first need two such neurons to find the two lines. Shown above are what those two lines might look like. (I say “might” because each time you train the neurons, you might get lines with slightly different slopes and offsets.)

我们越来越明白为什么需要多层来解决 XOR 问题——在本例中,至少需要两层。向同一层添加更多神经元只会找到更多线条。这不是我们想要的。我们希望神经元能够利用这些线条并将它们组合成更复杂的东西——在我们的例子中,将 2D 空间分成两个区域,一个位于两条线之间,另一个是 xy 平面的其余部分。

We are getting warmer as to why we need multiple layers to solve the XOR problem—in this case, at least two layers. Adding more neurons to the same layer will simply find more lines. That’s not what we want. We want neurons that can take these lines and combine them into something more complex—in our case, a 2D space separated into two regions, one that lies between the two lines and another that’s the rest of the xy plane.

因此,第一层包含两个神经元,每个神经元负责找到一条线。第二层将包含至少一个神经元,用于学习创建第一层神经元输出的加权和,以便将 xy 平面划分为两个区域,一个区域为圆形,另一个区域为三角形(如上所示)。让我们构建一个这样的神经元网络,从一个熟悉的神经元开始:

So, the first layer involves two neurons, each of which finds one line. The second layer will consist of at least one neuron that learns to create a weighted sum of the outputs of the neurons of the first layer, such that it delineates the xy plane into two regions, one for circles and the other for triangles (shown above). Let’s build a network of such neurons, starting with a familiar-looking neuron:

神经元接受两个输入x1x2,并产生输出y,这取决于两个处理阶段。

The neuron takes in two inputs, x1 and x2, and produces an output y, which depends on two processing stages.

z = w 1 x 1 + w 2 x 2 + b

z = w1x1 + w2x2 + b

y = az

y = a(z)

如果a ( z ) = z,我们有一个简单的线性神经元

If a(z) = z, we have a simple linear neuron

az)= z

a(z) = z

y = w 1 x 1 + w 2 x 2 + b

y = w1x1 + w2x2 + b

正如我们在第 9 章中学到的,函数a ( z ) 称为激活函数。在第 1 章和第 2 章中我们遇到的神经元中,a ( z ) 是一个阈值函数。此类函数的一个例子是:

As we learned in chapter 9, function a(z) is called an activation function. In the neurons we encountered in chapters 1 and 2, a(z) was a threshold function. An example of such a function would be:

如果z > 0:a ( z ) = 1

If z > 0: a(z) = 1

否则az)= 0

Else: a(z) = 0

阈值函数的一个问题是它不是处处可微的,也就是说它不是处处都有导数或斜率。处处不可微并不一定会导致函数失败——有办法在有问题的位置近似导数——但在我们的阈值函数中,斜率总是零,除了在过渡点,那里是无穷大。这没什么用。但稍加调整就可以使阈值函数连续,这样就可以微分了:

One of the problems with the thresholding function is that it’s not differentiable everywhere, meaning it doesn’t have a derivative, or slope, everywhere. Not being differentiable everywhere doesn’t necessarily doom a function—there are ways to approximate the derivative at problematic locations—but in the case of our thresholding function, the slope is zero always, except at the point of transition, where it’s infinite. This is unhelpful. But a small tweak can make the threshold function continuous, so that it can be differentiated:

这是 S 型函数。我们在上一章的通用近似定理中遇到过它。该函数的图如下所示。

This is the sigmoid function. We encountered it in the previous chapter, on the universal approximation theorem. The plot of the function is shown below.

请注意,当x小于 -1时,该函数是平滑的,其值接近于 0(但不为 0) ,然后它开始上升到 1。就像阈值函数一样,它产生的输出将某个坐标空间划分为两个区域,S 形函数的作用大致相同。基本上,当z趋向于无穷大时,函数趋向于 1,而当z趋向于负无穷大时,函数趋向于零。曲线的一小部分(在图中,介于z约为 -0.5 和 0.5),其中 S 型函数几乎是一条直线。

Note that the function is smooth and has a value close to 0 (but not 0) when x is less than -1, and then it starts rising to a value of 1. Just like the thresholding function, which produces an output that divides some coordinate space into two regions, the sigmoid does the same, approximately. Basically, as z tends to infinity, the function tends to 1, and as z tends to minus infinity, the function tends to zero. There’s a small portion of the curve (in the figure, between z of about -0.5 and 0.5), where the sigmoid function is nearly a straight line.

最重要的是,该函数有一个导数(请参阅本页的尾声以了解推导过程),并且该导数用函数本身来表示:

Most important, the function has a derivative (see the coda on this page for the derivation), and this derivative is expressed in terms of the function itself:

现在,我们可以将所有元素放在一起,设计一个具有一个隐藏层的简单神经网络来解决 XOR 问题。它是一个三层网络(见下图):第一层只是输入层(x1x2),第二层是隐藏层,具有两个隐藏神经元,输出层有一个神经元。(同样,每个神经元的偏差没有明确显示;将其视为给定值。)

We can now put all the elements together to design a simple neural network with one hidden layer, to solve the XOR problem. It’s a three-layer network (see figure, below): The first is simply the input layer (x1, x2), the second is the hidden layer, with two hidden neurons, and the output layer has one neuron. (Again, the bias is not shown explicitly for each neuron; take it as a given.)

第一个隐藏神经元的输出是:

The output of the first hidden neuron is:

我们现在遵循上一章介绍的符号。除了我们已经知道的权重的上标和下标、值za以及偏差b,下标指的是层,上标指的是层中神经元的位置。因此,指的是第 2 层的权重,以及来自上一层的神经元 2 并到达当前层的神经元 1 的输入。指的是第 2 层神经元 1 的输出。

We are now following the notation introduced in the previous chapter. Besides the superscripts and subscripts for the weights, which we already know about, for the values z and a, and the bias b, the subscript refers to the layer, the superscript to the location of the neuron in the layer. So, refers to a weight for layer 2, and for an input coming from neuron 2 of the previous layer and going to neuron 1 in the current layer. And refers to the output of a neuron 1 in layer 2.

考虑到这一点,我们看到第一层(隐藏层)的第二个神经元的输出是:

With this in mind, we see that the output of the second neuron in the first layer, which is a hidden layer, is:

更一般地,对于任何神经元:

More generically, for any neuron:

z = wx + b

z = w.x + b

a = σz

a = σ(z)

最后,输出神经元对两个隐藏神经元的输出进行加权和,并将其传递给 S 型激活函数:

Finally, the output neuron takes a weighted sum of the output of the two hidden neurons and passes that through a sigmoid activation function:

如果您想要训练这样的网络,您必须对某个损失L计算以下偏导数。

If you want to train such a network, you must calculate the following partial derivatives, for some loss L.

对于输出神经元的权重和偏差:

For the weights and bias of the output neuron:

对于隐藏神经元的权重和偏差:

And for the weights and biases of the hidden neurons:

一旦我们计算出这些偏导数,或者损失函数相对于所有权重和偏差的梯度,我们就可以逐步更新每个权重和偏差并进行梯度下降。给定训练数据,这样的网络将学习使其能够解决 XOR 问题的权重和偏差:它将以前面显示的方式描绘 xy 坐标空间。

Once we calculate these partial derivatives, or the gradient of the loss function w.r.t. all the weights and biases, we can then update each weight and bias incrementally and do gradient descent. Such a network, given training data, will learn the weights and biases that enable it to solve the XOR problem: It’ll delineate the xy coordinate space in the manner shown earlier.

虽然我们解决问题的方法对于 XOR 问题来说似乎很容易处理,但请发挥你的想象力,去想象一个有几十或几百个隐藏层的网络,每个隐藏层层由 100 个、1,000 个甚至 10,000 个神经元组成。如果损失函数(到目前为止我们一直在使用一个简单的函数)本身变得非常复杂(当要解决的问题变得更加困难时,就会发生这种情况)怎么办?如果我们必须明确地、分析性地计算损失函数相对于网络中每个权重和偏差的偏导数,那么这个过程很快就会变得非常不切实际。

While the way in which we solved the problem may seem tractable for the XOR issue, just let your imagination wander a little, to a network with tens or hundreds of hidden layers, with each hidden layer composed of 100 or 1,000 or even 10,000 neurons. And what if the loss function—we have been using a simple one so far—itself becomes hideously complicated (as can happen when the problem being solved becomes more difficult)? If we have to explicitly and analytically calculate the partial derivative of the loss function w.r.t. each weight and bias in the network, the process can quickly become insanely unrealistic.

那么,如何训练网络,或者如何以某种可持续的方式找到每个偏导数,而不必因为改变每层神经元的数量或层数而调整算法?这是 20 世纪 70 年代末和 80 年代初研究人员所研究的问题。首先是 Werbos,然后是 Rumelhart、Hinton 和 Williams,他们各自开发了一种使用链式法则计算偏导数的优雅技术。

How, then, do you train the network, or how do you find each of the partial derivatives, in some sustainable manner, where the algorithm doesn’t have to be tweaked just because you changed the number of neurons per layer or the number of layers? This was the question consuming researchers in the late 1970s and early ’80s. First Werbos and then Rumelhart, Hinton, and Williams, independently, developed an elegant technique for calculating the partial derivatives using the chain rule.

反向传播算法

THE BACKPROPAGATION ALGORITHM

为了理解“反向传播”(Rosenblatt 提出的术语),我们将转向最简单的单隐藏层网络,其中只有一个隐藏神经元。

To understand “backpropagation” (the term introduced by Rosenblatt), we’ll turn to the simplest possible one-hidden-layer network, with one hidden neuron.

在训练期间,对于某个输入x,网络会产生输出yhat。以下是得出yhat的计算序列。 (为了避免繁琐的下标和上标,我们只对第一层的隐藏神经元使用权重w1和偏差b1 ,对第二层的输出神经元使用w2b2 。)

During training, for some input x, the network produces an output yhat. Here is the sequence of computations that lead to yhat. (To avoid unwieldy subscripts and superscripts, we’ll just use the weight w1 and bias b1 for the hidden neuron in the first layer and w2 and b2 for the output neuron in the second layer.)

z1 = w1x + b1

z1 = w1x + b1

a1 = σ ( z1 )

a1 = σ(z1)

z2 = w2a1 + b2

z2 = w2a1 + b2

yhat = σ ( z 2)

yhat = σ(z2)

如果预期输出为y,则误差e为:

If the expected output is y, then the error e is:

e = (y - yhat

e = (y - yhat)

我们将误差的平方定义为损失——但请记住,我们选择这样定义损失;对于任何给定的问题,都必须选择一个合适的损失函数。就我们的目的而言,我们可以使用这个流行的损失函数,而不会失去通用性:

We’ll define the square of the error as the loss—but remember, we are choosing to define the loss as such; for any given problem, one has to select an appropriate loss function. For our purposes, we can use this popular loss function without any loss of generality:

L = e2

L = e2

为了更新两组权重和偏差,我们需要以下偏导数:

To update the two sets of weights and bias, we’ll need the following partial derivatives:

这里优雅地运用了链式法则来求这些偏导数。(有一个技巧可以确保链式法则正确:只需用右边相应的分子取消第一个偏导数的分母,然后继续向右移动。最后,你应该只剩下等式左边的偏导数。但请记住这不是一个实际的数学运算,而只是一个确保链条正确的装置。

Here’s an elegant use of the chain rule to get at these partial derivatives. (There’s a trick to making sure you have the chain correct: Just cancel the denominator of the first partial derivative with the corresponding numerator to its right, and continue doing so, moving rightward. In the end, you should be left with just the partial derivative on the left-hand side of the equation. But keep in mind that this is not an actual mathematical operation, just a device to ensure that the chain is correct.)

等式右边的每一个单独的偏导数都很容易计算:

Each of these individual partial derivatives on the right-hand side of the equation is easily calculated:

所以:

So:

看起来,计算损失函数对一个权重(本例中为w2)的偏导数需要做大量工作,即使是最简单的网络,但很酷的是:右侧的每个元素在网络的前向传递过程中都已经计算过了,即当网络将输入x转换为输出yhat。我们要做的只是跟踪这些数字并执行一些简单的算术运算。

This looks like a lot of work to calculate a partial derivative of the loss function with respect to just one weight (w2, in this case) of even the simplest possible network, but here’s the cool thing: Every element of that right-hand side was already computed during the forward pass through the network, when the network was converting the input x into the output yhat. All we must do is simply keep track of those numbers and perform some simple arithmetic.

相似地:

Similarly:

我们现在有了损失函数关于输出神经元的权重和偏差的梯度,这足以更新这两个参数。

We now have the gradient of the loss function w.r.t. the weight and bias of the output neuron, which is enough to update these two parameters.

但是隐藏神经元的权重和偏差怎么办呢?好吧,我们继续使用链式法则“反向传播”误差。

But what about the weight and bias of the hidden neuron? Well, we continue “backpropagating” the error using the chain rule.

相似地:

Similarly:

再次,网络在前向传递过程中计算了这些计算所需的一切。但是,请注意,计算现在需要知道第二层权重w2 的旧值。这意味着在前向传递之后,我们不仅需要记住所有计算的结果,还需要记住旧的权重。(题外话:关于生物大脑是否进行反向传播,有一个非常重要且有趣的问题。该算法被认为在生物学上是不可信的,正是因为它需要存储前向传递过程中使用的整个权重矩阵;没有人知道一个非常大的生物神经网络如何将这些权重矩阵保存在内存中。我们的大脑很可能正在实施一种不同的学习算法。)

Again, the network has computed everything that’s needed for these calculations during the forward pass through the network. However, notice that the computation now needs to know the old value of the weight of the second layer, w2. This means that after the forward pass, not only do we need to keep in memory the results of all the computations, but we also need to remember the old weights. (An aside: There’s a very important and interesting question about whether biological brains do backpropagation. The algorithm is considered biologically implausible, precisely because it needs to store the entire weight matrix used during the forward pass; no one knows how an immensely large biological neural network would keep such weight matrices in memory. It’s very likely that our brains are implementing a different learning algorithm.)

现在我们可以更新第一层的权重和偏差:

Now we can update the weight and bias of the first layer:

您可以看到,如果我们的网络有多个隐藏层,并且每个隐藏层有多个神经元,那么您基本上可以根据每个权重和偏差计算梯度并更新它们。这是反向传播算法。(请参阅本页的尾声了解如何推广此结果。)

You can see how, if our network had more than one hidden layer and more than one neuron per hidden layer, you can basically calculate the gradient with respect to each weight and bias and update them, too. This is the backpropagation algorithm. (See the coda on this page for how to generalize this result.)

这就是反向传播算法的强大之处。如果从输入到损失的计算序列在每一步都是可微的,我们就可以计算损失函数的梯度。给定梯度,我们可以稍微更新每个权重和偏差,从而执行梯度下降,直到损失达到可接受的最小化。

This is the awesome power of the backpropagation algorithm. If the sequence of computations that lead from the input to the loss is differentiable at every step, we can compute the gradient of the loss function. Given the gradient, we can update each weight and bias a tiny bit, thus performing gradient descent until the loss is acceptably minimized.

算法的灵活性和强大性怎么强调都不为过。原则上,你可以构建一个具有任意层数、每层具有任意数量神经元的网络;构建具有稀疏或密集连接的网络;设计适当的损失函数。所有这些选择共同决定了你要求网络执行的任务。训练最终归结为:为网络提供一些输入,确定预期输出应该是什么(要么是因为我们人类已经注释了数据并且知道输出应该是什么,要么是因为在称为自监督的学习类型中,预期输出是输入本身的一些已知变体),计算损失,计算损失的梯度,更新权重/偏差,然后重复。下面是一个多层感知器或完全连接的深度神经网络的示例。

It’s hard to overstate the flexibility and power of the algorithm. You can, in principle, construct a network with any number of layers, any number of neurons per layer; build a network with either sparse or dense connections; design the appropriate loss function. All these choices together dictate the task you are asking your network to perform. Training eventually comes down to this: Provide the network with some set of inputs, figure out what the expected output should be (either because we humans have annotated the data and know what the output should be or because, in types of learning called self-supervised, the expected output is some known variation of the input itself), calculate the loss, calculate the gradient of the loss, update the weights/biases, rinse and repeat. below is an example of a multi-layer perceptron, or a fully connected deep neural network.

第一层是输入层。在这个例子中,它涉及识别手写数字的图像并相应地对其进行分类,输入层由 784 个神经元组成,每个神经元对应 28×28 图像的每个像素。二维图像被扁平化为 784 维向量。接下来是第一个隐藏层。它可以有 10 个、100 个或 1,000 个或更多的神经元。(任务越复杂,你需要的神经元就越多需要。)这里需要注意的是,在完全连接的神经网络或多层感知器中,每一层的每个神经元都会接收来自前一层的所有输入。因此,在第一个隐藏层的情况下,该层的第一个神经元接收所有 784 个输入,该层中的每个其他神经元也是如此。假设第一层有 1,000 个神经元。这意味着该层有 1,000 个输出。因此,下一层中的每个神经元都会接收这 1,000 个输出中的每一个作为输入。

The first layer is the input layer. For this example, which involves recognizing images of handwritten digits and classifying them accordingly, the input layer is made of 784 neurons, one for each pixel of the 28×28 image. The 2D image is flattened into a 784-dimensional vector. Next is the first hidden layer. This can have 10 or 100 or 1,000 or more neurons. (The more complex the task, the more neurons you’ll need.) What’s important to note here is that in a fully connected neural network, or multi-layer perceptron, each neuron of each layer receives all the inputs from the previous layer. So, in the case of the first hidden layer, the first neuron of the layer receives all 784 inputs, as does every other neuron in that layer. Let’s say there are 1,000 neurons in the first layer. That means there are 1,000 outputs coming out of that layer. So, each neuron in the next layer receives each one of those 1,000 outputs as inputs.

由于无法显示这种密集的连接,因此上图仅显示了一些连接。但你应该明白我的意思了。

It’s impossible to show this dense connectivity, so only some connections are shown in the above figure. But you get the idea.

图中显示了四个隐藏层。同样,任务越复杂,可能需要的隐藏层就越多。甚至隐藏层中的神经元数量也可能因层而异。

The figure shows four hidden layers. Again, the more complex the task, the more hidden layers you might need. Even the number of neurons in the hidden layers can vary from layer to layer.

特别有趣的是最后的输出层。在这种情况下,我们有 10 个输出神经元。这个想法是,当一个训练有素的网络看到 10 个数字中的一个图像时,它会通过激活神经元 0 来响应数字 0,激活神经元 1 来响应数字 1,依此类推。(在每种情况下,其他神经元也可能被激活,但在训练有素的网络中,对于例如,输入数字 0 的图像,神经元 1-9 的输出将明显小于神经元 0 的输出,从而表示它检测到了“0”。)

Of particular interest is the final, output layer. In this case, we have 10 output neurons. The idea is that a trained network, when presented with the image of one of 10 digits, will respond by firing neuron 0 for digit 0, neuron 1 for digit 1, and so on. (In each case, the other neurons may fire, too, but in a well-trained network, for an input image of digit 0, for example, the outputs of neurons 1–9 will be significantly less than the output of neuron 0, thus signifying that it has detected a “0.”)

要了解反向传播的威力,请想象一下我们分析的那个愚蠢的网络:一个隐藏层和一个隐藏神经元。好吧,同样的过程可以用来训练这个无可否认更复杂的网络来识别手写数字。

To appreciate the power of backpropagation, think of the silly network we analyzed: one hidden layer with one hidden neuron. Well, the exact same process can be used to train this undeniably more complex network to recognize handwritten digits.

网络到底学习了什么?

WHAT EXACTLY DOES THE NETWORK LEARN?

在他的博士论文中,Paul Werbos 通过创建一个中间操作表来展示该算法的工作原理,该表列出了导致最终结果的中间操作。他这样描述反向传播的过程:“一般来说,该程序……允许我们沿任何有序运算表反向计算导数,只要运算对应于可微分函数。”最后一个警告是关键:该链中的每个环节都必须是可微分的,或者至少一个环节应该能够在任何地方令人满意地近似函数的导数。然而,Werbos 当时并没有考虑神经网络。

In his Ph.D. thesis, Paul Werbos had shown how this algorithm would work by creating a table of the intermediate operations leading to the final result. He wrote about this procedure of backpropagation, “In general, the procedure…allows us to calculate the derivatives backwards down any ordered table of operations, so long as the operations correspond to differentiable functions.” The final caveat is key: Every link in that chain has to be differentiable, or at least one should be able to satisfactorily approximate the function’s derivative everywhere. Werbos, however, wasn’t thinking about neural networks at the time.

20 世纪 70 年代,那些研究神经网络的人一直在研究二元阈值神经元。这种神经元的阈值激活函数在函数从 0 突然变为 1 的点上是不可微的。

Those who were thinking about neural networks in the 1970s had been working with binary threshold neurons. The threshold activation function of such neurons is not differentiable at the point where the function abruptly transitions from, say, 0 to 1.

使用 S 型函数作为激活函数是 Rumelhart、Hinton 和 Williams 所做工作的细微差别之一。另一个进步与我们之前遇到的对称性破坏问题有关。当 Hinton 告诉 Rumelhart 他对 Rosenblatt 工作的解释时——破坏对称性需要随机神经元——Rumelhart 立即找到了不同的出路。“他的第一反应是‘好吧,为什么我们不能直接打破对称性?’“通过设置随机的初始权重来破坏对称性?’罗森布拉特没有想到这一点,”辛顿告诉我。辛顿也没有想到。基本上,通过将网络中的每个权重和偏差的初始值设置为某个小的随机值(例如,从简单的高斯分布中采样),可以确保对称性被破坏。

Using the sigmoid as an activation function was one of the nuances of the work done by Rumelhart, Hinton, and Williams. The other advance had to do with concerns about symmetry breaking that we encountered earlier. When Hinton told Rumelhart about his interpretation of Rosenblatt’s work—that breaking symmetry necessitated stochastic neurons—Rumelhart instantly saw a different way out. “His immediate reaction was ‘Well, why can’t we just break symmetry by having random initial weights?’ Rosenblatt didn’t think of that,” Hinton told me. Neither had Hinton. Basically, by setting the initial value of each weight and bias in the network to some small random value (sampled from, say, a simple Gaussian distribution), one could ensure that symmetry was broken.

Hinton 认为 Rumelhart 设计了该算法,或者说重新发明了它,因为其他人已经想到了其他用途。Hinton 帮助改进、实施和测试该算法;Williams 帮助进行数学计算;Rumelhart 和 Hinton 则专注于该算法如何让多层神经网络学习。他们感兴趣的不仅仅是具有隐藏层的神经网络可以逼近任何函数(只要有足够的神经元,它就可以做到)。“我们是使用反向传播来开发有趣表示的团队,”Hinton 说,他现在在多伦多大学工作。

Hinton credits Rumelhart with designing the algorithm—or, rather, reinventing it, as others had already thought of it for other purposes. Hinton helped refine, implement, and test it; Williams helped with the math; and Rumelhart and Hinton focused their attention on what the algorithm empowered a multi-layer neural network to learn. They weren’t interested only in the fact that a neural network with hidden layers could approximate any function—which it could, given enough neurons. “We were the group that used backpropagation to develop interesting representations,” said Hinton, who is now at the University of Toronto.

这就是神经网络的重要性所在。我们在前面章节中看到的算法,包括支持向量机,都要求我们事先指定数据中的特征。假设我们正在处理一个二维数据集。明显的特征是值x1x2。但这并不总是有效。例如,要将下面显示的数据集中的圆形和三角形分开,在二维中工作的线性分类器就无法做到这一点。

And therein lies the import of neural networks. The algorithms we saw in earlier chapters, including support vector machines, all required us to specify beforehand the features in the data. Let’s say we are dealing with a two-dimensional dataset. The obvious features would be the values x1 and x2. But this won’t always work. For example, to separate the circles from the triangles in the dataset shown below, a linear classifier that works in two dimensions won’t do.

我们从以前解决这个问题的尝试中知道,仅仅使用 [ x1, x2 ] 作为特征是不够的。相反,我们需要一些非线性特征。具体来说,我们需要提前知道这些特征。例如,为了解决这个特定的问题,我们可以使用三元组特征 [ x1, x2, x1x2 ]。即使我们使用内核将这些数据投影到高维,然后在高维空间中进行线性分类,我们仍然需要设计内核。但是,对于具有足够神经元的神经网络,我们需要做的就是提供输入x1x2,让网络找出正确分类数据所需的特征。它将在内部学习表示这些特征。隐藏层中只有三个神经元的神经网络可能会为我们的示例数据集找到以下决策边界。(更多的隐藏层神经元将实现更平滑的决策边界。):

We know from previous attempts at solving this problem that it’s not enough to just use [x1, x2] as the features. Rather, we need some nonlinear features. Specifically, we need to know these features in advance. For instance, to solve this particular problem, we could use a triplet of features [x1, x2, x1x2]. Even if we used a kernel to project this data into high dimensions, and then did a linear classification in the higher-dimensional space, we’d still need to design the kernel. But, with a neural network with sufficient neurons, all we would need to do is provide the inputs x1 and x2 and let the network figure out the features needed to classify the data correctly. It’ll learn to represent these features internally. A neural network with just three neurons in the hidden layer might find the following decision boundary for our example dataset. (More hidden layer neurons would enable a smoother decision boundary.):

Rumelhart、Hinton 和 Williams 在他们关于反向传播的论文中强调了这一点,论文标题为“通过反向传播误差学习表征”。他们的论文摘要指出,“作为权重调整的结果,不属于输入或输出的内部“隐藏”单元开始代表任务领域的重要特征,以及这些单元的交互捕获了任务中的关键信息。创建有用新功能的能力使反向传播与感知器收敛过程等早期、更简单的方法区分开来。”

Rumelhart, Hinton, and Williams emphasized this aspect in their paper on backpropagation, the title of which read, “Learning Representations by Back-propagating Errors.” The abstract of their paper states, “As a result of the weight adjustments, internal ‘hidden’ units which are not part of the input or output come to represent [italics mine] important features of the task domain, and the regularities in the task are captured by the interactions of these units. The ability to create useful new features distinguishes back-propagation from earlier, simpler methods such as the perceptron-convergence procedure.”

当然,发表这篇论文(它只有三页)需要做一些基础工作。三人将论文寄给了《自然》杂志。 “我在英国做了一些政治工作,去和所有可能成为审稿人的人交谈,”辛顿告诉我。其中一位是苏塞克斯大学的实验心理学家斯图尔特·萨瑟兰。辛顿向萨瑟兰描述了反向传播如何让神经网络学习表征。“我花了一段时间才向他解释清楚,但后来他真的明白了,”辛顿说。游说得到了回报。《自然》杂志将论文寄给同行评审的任何人都对它评价很高;该杂志接受了它并发表了它。萨瑟兰甚至在同一期杂志上写了一篇配套文章。

Of course, publishing the paper—it’s barely three pages long—involved laying some groundwork. The trio sent it to the journal Nature. “I did some political work in Britain of going and talking to all the people who might be referees,” Hinton told me. One of them was Stuart Sutherland, an experimental psychologist at the University of Sussex. Hinton described to Sutherland how backpropagation allowed a neural network to learn representations. “It took a while to explain to him, but then he really got it,” Hinton said. The canvassing paid off. Whomever Nature sent the paper to for peer review thought well enough of it; the journal accepted it for publication. Sutherland even wrote an accompanying essay in the same issue of the journal.

学习特征而非手动设计特征的重要性将在下一章中变得越来越清晰,届时我们将讨论图像识别,这一应用让神经网络声名鹊起。就在鲁梅尔哈特、辛顿和威廉姆斯撰写反向传播论文时,巴黎的一位年轻学生已经独立开发了一种算法,取得了类似的结果。一位同事告诉辛顿:“法国有个孩子也在做同样的事情。”这个孩子就是 Yann LeCun。当 Hinton 在欧洲的一次会议上遇到 LeCun 时——当时反向传播论文尚未发表——他们之间一拍即合。“我们互相补充着对方的句子,”LeCun 告诉我。“他向我解释了他正在做什么。他正在研究反向传播。”LeCun 立刻明白了这项工作的重要性。他告诉 Hinton,“你不必向我解释。”不久之后,Hinton 和 LeCun 进行了短暂的合作,然后建立自己的主要实验室,为深度学习革命(我们下一章的主题)奠定基础。

The importance of learning features, rather than having to hand-design them, will become increasingly clear in the next chapter, when we tackle image recognition, the application that brought neural networks fame. Even as Rumelhart, Hinton, and Williams were working on their backpropagation paper, a young student in Paris had independently developed an algorithm that achieved similar results. A colleague told Hinton that “there is a kid in France who is working on the same stuff.” The kid was Yann LeCun. When Hinton met LeCun during a conference in Europe—this was before the backpropagation paper had been published—their intellectual chemistry was instant. “We were completing each other’s sentences,” LeCun told me. “He explained to me what he was working on. He was working on backprop.” LeCun immediately understood the import of the work. He told Hinton, “You don’t have to explain that to me.” Hinton and LeCun would collaborate briefly soon after and then establish major labs of their own, setting the stage for the revolution in deep learning, the subject of our next chapter.

与此同时,1987 年,在《自然》杂志发表论文后不久,鲁梅尔哈特就搬到了斯坦福大学。1998 年,他因患上皮克氏病(一种渐进性神经退行性疾病)而退休;2011 年他去世。“如果他还活着,他将成为反向传播的大部分功劳,”辛顿说。事实上,辛顿最常与该算法联系在一起,也因该算法而受到赞誉,尽管连他自己也承认,他只是众多努力解决这个问题的人之一。

Meanwhile, in 1987, soon after the publication of the Nature paper, Rumelhart moved to Stanford University. He retired in 1998, as he fell ill with Pick’s disease, a progressive neurodegenerative disorder; he died in 2011. “If he had lived, he would have been the person who got most of the credit for backprop,” Hinton said. As it happens, it’s Hinton who most often gets associated with and fêted for the algorithm, though even he admits he was simply one of a long chain of people who had grappled with the problem.

数学尾声

MATHEMATICAL CODA

S 型函数的导数

The derivative of the sigmoid function

S 型函数为:

The sigmoid function is:

u = 1 + e - z

Let u = 1 + e-z

所以:

So:

使用链式法则:

Using the chain rule:

表达式的第一部分是:

The first part of the expression is:

表达式的第二部分是:

The second part of the expression is:

所以:

So:

量子电动力学

QED

反向传播算法的推广

GENERALIZATION OF THE BACKPROPAGATION ALGORITHM

让我们从输入向量x开始。假设x = [ x1 , x2 ]。取神经网络的第一个隐藏层。假设它有三个神经元。该层中的每个神经元将创建一个加权和加偏差。

Let’s start with an input vector, x. Say x = [x1, x2]. Take the first hidden layer of a neural network. Let’s say it has three neurons. Each neuron in the layer will create a weighted sum plus bias.

第一个神经元的加权和将是:

The first neuron’s weighted sum will be:

第二个神经元的加权和将是:

The second neuron’s weighted sum will be:

第三个神经元的加权和将是:

The third neuron’s weighted sum will be:

我们可以将其写成:

We can write this as:

, 在哪里

, where

该层的中间输出必须通过激活函数。我们可以继续使用 S 型函数,但也可以使用其他函数。只要激活函数可微,细节就不重要。

This intermediate output of the layer has to pass through the activation function. We can continue using the sigmoid function, though other functions can be used. The particulars don’t matter as long as the activation function is differentiable.

第 1 层的输出在激活后为:a 1 = σ ( z 1 )

The output of layer 1, after activation, is: a1 = σ(z1)

这句话的意思是,每个神经元的输出(加权和加上偏差)都经过 S 形函数。让我们将其扩展为一个由三个隐藏层和一个最终输出层组成的网络。以下是网络执行的操作序列:

What this is saying is that each neuron’s output—the weighted sum plus bias—is passed through a sigmoid. Let’s extend this to a network of three hidden layers and a final, output layer. Here’s the sequence of operations performed by the network:

第 1 层:

Layer 1:

第 2 层:

Layer 2:

第 3 层:

Layer 3:

现在我们来到最后的输出层。输出层中可以有任意数量的神经元,但就我们的目的而言,我们考虑只有一个神经元的层。

We now come to the final, output layer. There can be any number of neurons in the output layer, but for our purposes, consider a layer with just one neuron.

输出:

Output:

yhat = σ ( z 4 )

yhat = σ(z4)

计算误差和损失可得出:

Calculating the error and loss gives us:

e = (y - yhat

e = (y - yhat)

L = e2

L = e2

现在,我们已经掌握了计算损失函数相对于所有权重和偏差的梯度的所有要素。例如,假设我们想要损失函数L相对于第三层权重的偏导数。其由以下公式给出(请注意,这些方程使用权重的矩阵形式,并使用层中所有神经元的输出的矢量形式,以实现紧凑性):

We now have all the ingredients for calculating the gradient of the loss function with respect to all the weights and biases. For example, let’s say we want the partial derivative of the loss function L w.r.t. the weights of the third layer. That’s given by (note that these equations use the matrix form of the weights, and vector forms for the outputs of all the neurons in a layer, for compactness):

我们知道如何计算每个右边项:

We know how to calculate each right-hand side term:

每个偏导数都等于我们在前向传递网络时计算出的某个值(例如z 4a 2的值),或者等于权重的当前值(例如)。现在,我们有了损失函数相对于某一层权重的梯度,我们可以使用 delta 规则来更新权重。

Each partial derivative equates to something that we either computed during the forward pass through the network (such as the value z4 or a2) or to the current value of the weights (such as ). We now have the gradient of the loss function with respect to the weights of one layer, and we can use the delta rule to update the weights.

就是这样!

That’s it!

第十一章

CHAPTER 11

机器之眼

The Eyes of a Machine

几乎所有关于计算机视觉深度神经网络历史的记载都承认神经生理学家 David Hubel 和 Torsten Wiesel 所做的开创性工作,他们是 20 世纪 60 年代早期哈佛大学神经生物学系的联合创始人,也是 1981 年诺贝尔生理学或医学奖的共同获得者。诺贝尔奖是为了表彰他们在猫的视觉系统方面的研究;该奖项是在他们最具开创性的工作完成十五年后颁发的,即使在这十五年中,两人仍然保持着惊人的高产。1982 年,英国视觉科学家 Horace Barlow 写道 Hubel 和 Wiesel 获得了诺贝尔奖:“现在我们必须考虑,这个奖项不仅是当之无愧的奖项之一,也是来之不易的奖项之一。”

Almost all accounts of the history of deep neural networks for computer vision acknowledge the seminal work done by neurophysiologists David Hubel and Torsten Wiesel, co-founders of the Department of Neurobiology at Harvard in the early 1960s and joint winners of the 1981 Nobel Prize in Physiology or Medicine. The Nobel was for their work on the cat’s visual system; it was awarded about fifteen years after their most pioneering work, and even in that intervening decade and a half, the two remained astonishingly productive. In 1982, the British vision scientist Horace Barlow wrote about Hubel and Wiesel’s Nobel: “By now the award must be considered, not only one of the most richly-deserved, but also one of the hardest-earned.”

Hubel 和 Wiesel 的早期工作涉及通过记录猫脑中单个神经元的电活动来创建视觉皮层地图,同时他们使用幻灯机向猫展示视觉图案。这种平淡无奇的描述掩盖了实验的艰苦性。这些实验的详细描述并不适合那些难以接受的人。这项工作源于 Hubel 于 1957 年发明的钨电极,用于记录大脑中单个神经元或单位的电活动。这本身就是一项开创性的努力。在此之前,用于此目的的最流行的设备是充满电解质的玻璃微量移液器电极尖端可以插入大脑。Hubel 想要一种即使动物移动也不会断裂的电极。其他人开发了钢电极,但不够坚硬。钨电极被证明是完美的选择。休贝尔写道:“该电极已用于记录仅受胸带束缚的长期清醒猫大脑皮层中单个单元的时长为 1 小时的时间段。”

Hubel and Wiesel’s early work involved creating a map of the visual cortex by recording electrical activity from individual neurons in the brains of cats, while they showed the cats visual patterns using, of all things, a slide projector. This matter-of-fact description obscures the painstaking nature of the experiments. The detailed description of these experiments is not for the queasy. The work has its roots in Hubel’s invention, in 1957, of a tungsten electrode to record the electrical activity of single neurons, or units, in the brain. This was a pioneering effort in itself. Until then, the most popular device for this purpose was an electrolyte-filled glass micropipette with a sharp tip that could be inserted into the brain. Hubel wanted something that wouldn’t break if the animal moved. Others had developed steel electrodes, but they weren’t stiff enough. Tungsten proved perfect. “The electrode has been used for recording single units for periods of the order of 1 hour from [the] cerebral cortex in chronic waking cats restrained only by a chest harness,” Hubel wrote.

Hubel 和 Wiesel 就是用这些电极记录麻醉猫的单个神经元活动。猫被腹膜内注射了硫喷妥钠麻醉剂,并在整个实验过程中一直处于麻醉状态。(如果皮层脑电图记录显示麻醉效果正在消失,则再次给猫注射麻醉剂。)在电极就位后,Hubel 和 Wiesel 必须向麻醉猫呈现视觉刺激。他们用金属夹将猫的眼睑保持张开,用 1% 阿托品(一种神经毒剂)扩张其瞳孔,并注射琥珀胆碱以固定眼部肌肉。琥珀胆碱是一种肌肉松弛剂,所以现在猫无法自行呼吸,这“必须进行人工呼吸。”科学家甚至使用带润滑剂的隐形眼镜来防止猫的眼睛“干燥并变得浑浊。”有了这个复杂的装置,Hubel 和 Wiesel 使用电极连续数小时研究猫初级视觉皮层中数百个单个神经元的活动,同时用钨丝投影仪将图案投射到猫的眼睛上。

It’s these electrodes that Hubel and Wiesel used to record the activity of single neurons in anesthetized cats. The cats were given the anesthetic intraperitoneal thiopental sodium and were kept under its influence throughout the experiment. (If electrocorticogram recordings showed that the effect was wearing off, the cat was dosed again.) With the electrodes in place, Hubel and Wiesel had to present visual stimuli to the anesthetized cat. They kept the cat’s eyelids open with wire clips, its pupils dilated using 1 percent atropine (a nerve agent), and injected succinylcholine to immobilize the muscles in the eyes. Succinylcholine is a muscle relaxant, so now the cat couldn’t breathe on its own, which “made it necessary to use artificial respiration.” The scientists even used contact lenses with lubricants to keep the cat’s eyes from “drying and becoming cloudy.” With this complicated setup in place, Hubel and Wiesel used electrodes to study, for hours on end, the activity of hundreds of individual neurons in the cat’s primary visual cortex as the cat’s eyes were exposed to patterns projected onto a screen using a tungsten filament projector.

如今,这样的实验会被允许吗?很难说。甚至到了 20 世纪 80 年代,关于此类实验是否具有道德问题的争论仍在报纸的评论版上展开。1983 年《纽约时报》的一篇文章提到了 Hubel 和 Wiesel 进行的一项后续实验,该实验涉及研究小猫的视力发育。“在哈佛大学,小猫的眼睛被缝住,导致失明,”当时还是法学院学生的散文家史蒂文·扎克写道“专门研究动物和法律。”这一说法引起了一位同情 Hubel 和 Wiesel 的读者的回应是:“除此之外,他们的工作还促成了新的眼科手术,将防止数千名儿童失明。有充分的证据表明,这些实验中的动物得到了人道的、体面的照顾,没有遭受剧烈的疼痛。”

Would such experiments be allowed today? It’s hard to say. Even by the 1980s, debates over the questionable ethics of such experiments were playing out in the op-ed pages of newspapers. A 1983 essay in The New York Times referred to a follow-up experiment Hubel and Wiesel did, which involved studying the development of vision in kittens. “At Harvard University, kittens were blinded by having their eyes sewn shut,” wrote essayist Steven Zak, at the time a law student “specializing in animals and the law.” This assertion drew a sharp response from a reader sympathetic to Hubel and Wiesel: “Among other things, their work has led to new ophthalmological procedures that will prevent blindness in thousands of children. There is ample evidence that the animals in those experiments received humane, decent care and did not suffer severe pain.”

Hubel 和 Wiesel 的发现无疑彻底改变了我们对视觉的理解,正如科学的长期发展所表明的那样,最终影响了基于深度神经网络的计算机视觉系统的设计。我们关心的是后者。

Hubel and Wiesel’s findings certainly revolutionized our understanding of vision and, as the long arc of science shows, eventually impacted the design of deep neural network–based computer vision systems. It’s the latter that concern us.

但首先要说明的是正是这种幸运让他们的发现成为可能。起初,无论 Hubel 和 Wiesel 向猫展示什么,都无法让猫的皮质神经元对视觉刺激作出反应。什么都行不通。然后,在一个经典的科学偶然发现案例中,事情发生了。一个神经元发射了,发出了与盖革计数器相关的断奏声音。经过一番侦查,Hubel 和 Wiesel 弄清楚了原因。他们意识到当他们在投影仪上更换幻灯片时,这个细胞就会发射:只有当幻灯片的边缘在移动时以特定角度定向并且边缘被投射到屏幕上时,神经元才会发射。幻灯片上的信息(在本例中为黑点)并不重要。是幻灯片的微弱边缘在猫的视野中移动的特定方向触发了神经元。Hubel 和 Wiesel 发现了一个边缘检测细胞。

But first, a note about the stroke of luck that made their findings possible. Initially, Hubel and Wiesel could not get the cat’s cortical neurons to fire in response to visual stimuli, no matter what they showed the cat. Nothing worked. Then, in a classic case of scientific serendipity, it happened. A neuron fired, producing the kind of staccato sounds associated with a Geiger counter. After some sleuthing, Hubel and Wiesel figured out why. They realized that the cell was firing when they were changing slides on their projector: The neuron fired only when the edge of the slide was oriented at a particular angle as it was being moved and the edge was being projected onto the screen. The information on the slide (in this case, a black dot) wasn’t important. It was a particular orientation of the faint edge of the slide moving across the cat’s visual field that was triggering the neuron. Hubel and Wiesel had discovered an edge-detecting cell.

Hubel 和 Wiesel 认为,视觉皮层处理信息的方式是具有层次结构的(这一说法并不像他们最初的解释那么明确,但我们会坚持这一观点,因为这一观点对人工智能产生了巨大的影响)。

Hubel and Wiesel argued that there’s a hierarchy in the way information is processed in the visual cortex (a claim that has turned out to be not as clear-cut as in their initial account, but we’ll stick to it because it’s a view that has influenced AI enormously).

我们需要一些定义来理解他们的论点:

We need some definitions to appreciate their argument:

  • “视野”是我们眼前的区域,当眼睛聚焦于前方某物时,我们的眼睛会随时对其敏感——“聚焦”部分很重要。否则,只需移动眼睛就会改变视野。视野中的刺激会触发神经反应。
  • The “visual field” is the region in front of us that our eyes are sensitive to at any instant, when the eyes are focused on something in front—the “focused” part is important. Otherwise, simply moving one’s eyes would change the visual field. A stimulus in the visual field triggers a neural response.
  • “感受野”是指视野中触发单个神经元的部分。神经元的感受野大小可以从非常小到很大不等,我们很快就会看到。如果神经元的感受野中有适当的刺激,该神经元就会激发。感受野最小的神经元,即直接监测视网膜上图像的神经元,被称为视网膜神经节细胞,即从视网膜接收输入的第一层神经元。
  • The “receptive field” refers to the portion of the visual field that triggers a single neuron. The size of the receptive field of a neuron can vary from the very small to large, as we’ll soon see. If there’s an appropriate stimulus in the receptive field of a neuron, that neuron will fire. The neurons with the smallest receptive field, meaning those directly monitoring the image on the retina, are called retinal ganglion cells, the first layer of neurons that receive inputs from the retina.

现在假设视野中出现了一条垂直线形式的刺激,它与四个视网膜神经节细胞的受体区重叠。每个细胞都会响应其各自受体区中的信号而激发。这四个细胞(四个只是一个例子)连接到一个“简单细胞”,这种神经元只有当所有四个视网膜神经节细胞 (RGC) 同时激发时才会激发。想象一个场景,其中四个 RGC 中只有一个激发。它检测到了刺激。它向简单细胞发出信号,但由于其他三个 RGC安静时,简单细胞不会激发。如果所有四个 RGC 都激发,则简单细胞激发,表明检测到垂直边缘。(这让人想起我们在前面章节中遇到的阈值激活函数:单个 RGC 输入保持低于驱动简单细胞所需的某个阈值,但以正确方式排列的多个 RGC 的输入超过阈值,导致简单细胞激发。)

Now let’s say there’s some stimulus in the form of a vertical line that appears in the visual field and overlaps the receptive fields of four retinal ganglion cells. Each cell will fire in response to the signal in its respective receptive field. These four cells—four is only an example—connect to a “simple cell,” a neuron that fires only when all four retinal ganglion cells (RGCs) fire together. Imagine a scenario where only one of the four RGCs fires. It has detected a stimulus. It signals the simple cell, but because the other three RGCs are quiet, the simple cell does not fire. If all four RGCs fire, then the simple cell fires, indicating the detection of a vertical edge. (This is reminiscent of the threshold activation function we encountered in earlier chapters: A single RGC input remains below some threshold needed to drive the simple cell, but inputs from multiple RGCs, arranged in the correct manner, exceed the threshold, causing the simple cell to fire.)

如果边缘呈一定角度会怎样?好吧,有一些简单的细胞会在边缘与垂直方向呈一定角度时触发。例如,以下是一个可以检测到 45 度角边缘的细胞:

What if the edge were at an angle? Well, there are simple cells that fire for an edge that’s at some angle to the vertical. For example, here’s one that detects an edge at a 45-degree angle:

这引出了关于视觉的一个非常有趣的讨论:不变性。想想垂直边缘。视野中可能有许多垂直边缘,而每个边缘都有一个简单细胞来检测。但是,如果我们想让一个神经元发出信号,无论垂直边缘在大型接受域中的位置如何,它都会发出垂直边缘的存在信号,该怎么办?这里有一个解决方案:许多简单细胞,每个细胞都对整个视野不同部分的垂直边缘作出反应,将它们的输出馈送到一个复杂细胞,如果任何一个简单细胞触发,复杂细胞就会触发。现在,复杂细胞被称为对平移不变:垂直边缘可以位于复杂细胞接受域的任何地方,并且只要它触发一些简单细胞,它就会触发复杂细胞的反应。

This leads us to a discussion about something incredibly interesting about vision: invariance. Think about a vertical edge. There might be many vertical edges in the visual field and a simple cell to detect each of those edges. But what if we want a neuron to signal the presence of a vertical edge regardless of its position in a large receptive field? Here’s a solution: Many simple cells, each responding to a vertical edge in a different part of the overall visual field, feed their outputs to a complex cell that fires if any one of the simple cells fires. Now the complex cell is said to be invariant to translation: A vertical edge can be anywhere in the receptive field of the complex cell, and as long as it triggers some simple cell, it’ll trigger a response from the complex cell.

类似地,你可以让一个复杂的细胞对其受体域中任意位置与垂直方向成 30 度的边缘做出反应。这些是空间或平移不变性的例子。

Similarly, you can have a complex cell that responds to an edge that’s at 30 degrees to the vertical anywhere in its receptive field. These are examples of spatial, or translational, invariance.

随着我们向上移动,神经元的接受域发生了一些有趣的事情。视网膜神经节细胞的接受域很小,并且 RGC 仅在其视野的微小区域(我们图中的小圆圈)中存在某种刺激时才会激活。但现在考虑一个边缘检测简单细胞。它的接受域场要大得多(由沿线排列的四个 RGC 的感受野组成)。只有当存在覆盖整个较大感受野的边缘状刺激时,简单细胞才会激发。让我们在层次结构中向上移动一级。以复杂细胞为例,它以空间不变的方式对垂直边缘的存在作出反应:它具有更大的感受野,该感受野覆盖了组成边缘检测简单细胞的感受野;当较大感受野中的任何地方有垂直边缘时,复杂细胞就会激发。

Something interesting happens to the receptive fields of neurons as we move up the hierarchy. The retinal ganglion cells have small receptive fields, and the RGCs fire only when there’s some stimulus in their tiny patch of the visual field (the small circles in our illustration). But now consider an edge-detecting simple cell. Its receptive field is much larger (composed of the receptive fields of, say, four RGCs aligned along a line). Only when there’s an edgelike stimulus that spans the entirety of that larger receptive field does the simple cell fire. Let’s move up one level in the hierarchy. Take the complex cell that fires in response to the presence of a vertical edge in a spatially invariant manner: It has a much larger receptive field that spans the receptive fields of the constituent edge-detecting simple cells; the complex cell fires when there’s a vertical edge anywhere in that larger receptive field.

另一种重要的不变性类型是旋转不变性,这种不变性很容易说明。下面是一个例子,当接受场中存在给定方向的边缘时,复杂细胞就会激活。这些复杂细胞将其输出馈送到超复杂细胞,当任何一个复杂细胞激活时,超复杂细胞就会激活。我们现在拥有的是一个旋转不变的细胞;只要超复杂细胞的接受场中存在边缘,细胞就会激活,无论边缘的方向如何。

Another important type of invariance, and one that’s easy to illustrate, is rotation invariance. Here’s an example in which complex cells fire when there’s an edge of a given orientation in the receptive field. These complex cells feed their outputs to a hypercomplex cell, which fires when any one of the complex cells fires. What we have now is a cell that’s invariant to rotation; as long as there’s an edge in the receptive field of the hypercomplex cell, the cell will fire, regardless of the orientation of the edge.

超复杂细胞还以对特定长度的边缘进行最大程度激发而闻名;较短或较长的边缘不会产生相同的效果。您可以组合这些超复杂细胞来检测 V 形图案。下图仅显示了分层超复杂细胞;在第一层超复杂细胞之前会有简单和复杂的细胞。

Hypercomplex cells are also known to fire maximally for an edge of some particular length; shorter or longer edges don’t have the same effect. You can combine such hypercomplex cells to detect, say, a chevron, or V-shaped, pattern. The following figure shows only the hierarchical hypercomplex cells; there would be simple and complex cells preceding the first layer of hypercomplex cells.

您可以想象将这些细胞结合起来,得到层次结构中更高级别的细胞,例如检测三角形或正方形的细胞;以及对三角形或正方形的存在具有平移或旋转不变性的细胞。我们可以继续想象对不同形状敏感的细胞;具有旋转和平移不变性的细胞;对拉伸、光照条件等具有不变性的细胞。我们可以让形状变得越来越复杂,直到我们最终得到一些“祖母”细胞,当您看到……嗯,您的祖母时,它会被激活(后面关于祖母检测细胞的部分是一些牵强的神经科学传说)。

You can imagine combining these to get to cells higher up in the hierarchy that detect, say, a triangle or a square; and cells that are translation or rotation invariant to the presence of triangles or squares. And we can go on, imagining cells that are sensitive to different shapes; that are rotation and translation invariant; that are invariant to stretching, to lighting conditions, and so on. We can let the shapes get more and more complex until we end up with some “grandmother” cell that fires when you see…well, your grandmother (the latter bit about grandmother detecting cells is some far-fetched neuroscience lore).

无论如何,这是一个有趣的故事,但实际的大脑回路要混乱和复杂得多。不过,这个有趣的故事促成了第一个基于 Hubel 和 Wiesel 工作的人工神经网络的诞生。

That’s a neat story, anyway, but actual brain circuitry is messier and way more complicated. Still, this neat story is what led to the first artificial neural network based on Hubel and Wiesel’s work.

新认知机

THE NEOCOGNITRON

在深度神经网络出现之前,计算机科学家采用一种特别繁琐的方法来赋予机器视觉。首先,他们会定义你期望在图像中看到的特征类型(线条、曲线、角、颜色等)。软件会分析图像以检测这些特征。它必须对各种情况都保持不变,比如我们刚刚遇到的情况。然后,软件的下一层将创建一个对象词典:每个对象都由一些特征组成。识别另一幅图像中的对象需要检测定义该对象的一组给定特征中的大量特征。由于词典可能会越来越大,因此这种方法在计算上非常复杂。

Before the advent of deep neural networks, computer scientists took a particularly tedious approach to endowing machines with vision. First, they’d define the kinds of features you’d expect to see in an image (lines, curves, corners, color, etc.). The software would analyze an image to detect these features. It’d have to be invariant to various situations, such as those we’ve just encountered. The next layer of the software would then create a dictionary of objects: each object would consist of some set of features. Recognizing an object in another image involved detecting a significant number of a given set of features that defined that object. Such approaches proved computationally complex, given that the dictionary could keep getting bigger and bigger.

但我们的大脑却能非常成功地做到这一点。这是怎么做到的呢?

And yet our brains do it incredibly successfully. How?

1975 年,东京 NHK 科学技术研究实验室的 Kunihiko Fukushima 在一篇题为“Cognitron:一种自组织的多层神经网络。该学习算法采用了赫布方法,用于调整神经元之间的权重。假设神经元x连接到神经元y,其中x是突触前神经元,y是突触后神经元。(在生物学中,突触是指电化学信号在神经元之间流动的连接。)假设神经网络被给予一些输入,这些输入在网络中级联,然后网络产生一个输出,无论是对还是错。对于我们的神经元对xy,算法会查看x是否激发。如果答案为“是”,则它会检查y是否激发。如果答案也是“是”,它会检查y是否比其他近端突触后神经元激发得更强烈。如果满足所有这些条件,算法就会加强xy之间的突触连接;这类似于校正连接的权重。继续这样做对于一系列输入,例如字母“X”,“Y”,“T”和“Z”,网络连接稳定下来,输出层为每个输入开发出独特的激活模式。

In 1975, Kunihiko Fukushima of the NHK Science and Technology Research Laboratories in Tokyo announced the first real neural network–based image recognition system of considerable complexity in a paper titled “Cognitron: A Self-Organizing Multilayered Neural Network.” The learning algorithm—which adjusts the weights between neurons—took a Hebbian approach. Let’s say neuron x is connected to neuron y, where x is the pre-synaptic neuron and y is the post-synaptic neuron. (In biology, a synapse refers to a connection through which electrochemical signals flow between neurons.) Let’s say that the neural network is given some input that cascades through the network, and the network produces an output, right or wrong. For our pair of neurons, x and y, the algorithm looks to see if x fired. If the answer is “yes,” then it checks to see if y fired. If the answer is also “yes,” it checks to see if y fired more strongly than other proximal post-synaptic neurons. If all these conditions are met, the algorithm strengthens the synaptic connection between x and y; this is akin to correcting the weights of connections. Keep doing this over and over for a range of inputs, such as the letters “X,” “Y,” “T,” and “Z,” and the network’s connections stabilize and the output layer develops a distinct pattern of activation for each input.

但正如福岛在 1980 年发表的后续论文中所写,认知机的“反应取决于刺激模式的位置。”如果相同的模式出现在视野中的不同位置,认知机就会将它们识别为不同的模式。换句话说,认知机不是平移不变的,更不用说在其他更复杂的方面不变了。

But as Fukushima wrote in a subsequent paper, published in 1980, the cognitron’s “response was dependent upon the position of the stimulus patterns.” If the same pattern appeared in different positions in the visual field, the cognitron recognized them as different patterns. In other words, the cognitron wasn’t translation invariant, let alone invariant in other, more complex ways.

在 1980 年的论文《神经认知机:不受位置变化影响的模式识别机制的自组织神经网络模型》中,福岛介绍了神经认知机,并通过采用明显受 Hubel 和 Wiesel 工作启发的架构来向他们致敬。(例如,神经认知机有 S 细胞和 C 细胞,用于模拟简单和复杂的细胞。)神经认知机的每一层都有 S 细胞,它们对某些特征(例如垂直边缘)做出反应。一层中的许多这样的 S 细胞(它们一起观察视野的某个区域)会馈入 C 细胞。如果 C 细胞激发,则表明视野的该区域有垂直边缘。该层中有许多这样的 C 细胞,每个细胞都对不同区域的垂直边缘做出反应。一层中所有 C 细胞的输出将作为下一层中 S 细胞的输入。因此,如果某一层中的 C 细胞各自对视野中某个部分的垂直边缘的存在做出反应,那么下一层中负责整理所有这些信息的 S 细胞将对整个视野中任何地方的边缘的存在做出反应。通过这种安排,我们获得了具有平移不变性的垂直边缘检测。

In the 1980 paper, “Neocognitron: A Self-Organizing Neural Network Model for a Mechanism of Pattern Recognition Unaffected by Shift in Position,” Fukushima introduced the neocognitron and paid homage to Hubel and Wiesel by adopting an architecture that was clearly inspired by their work. (For example, the neocognitron has S-cells and C-cells, meant to model simple and complex cells.) Each layer of the neocognitron has S-cells, which respond to some feature (say, a vertical edge). A number of such S-cells in a layer (which, taken together, are looking at some patch of the visual field) feed into a C-cell. If the C-cell fires, then it is indicating that there’s a vertical edge in that patch of the visual field. There are numerous such C-cells in that layer, each responding to a vertical edge in a different patch. The outputs from all the C-cells in one layer would then serve as inputs to an S-cell in the next layer. So, if the C-cells in one layer are each responding to the presence of a vertical edge in some part of the visual field, then the S-cell in the next layer, which is collating all this information, will respond to the presence of an edge anywhere in the overall visual field. With this arrangement, we get vertical edge detection with translation invariance.

利用这种架构,福岛的神经认知机可以学会检测模式,即使这些模式的位置发生了变化、扭曲或挤压。神经认知机拥有许多这样的层,识别数字的能力,即使数字在视野中的位置在变化,或者在可接受的程度上被扭曲。这在当时是一件大事。”“设计模式识别机器的最大和长期困难之一就是如何应对输入模式的位置变化和形状扭曲。神经认知机……为这一困难提供了彻底的解决方案,”福岛写道。

Using this architecture, Fukushima’s neocognitron could learn to detect patterns, even if those patterns were shifted in position or were distorted or squished. A number of such layers gave the neocognitron the ability to recognize digits, even if the digits were shifting positions on the visual field or were distorted to some acceptable extent. This was a big deal at the time. “One of the largest and long-standing difficulties in designing a pattern-recognizing machine has been the problem [of] how to cope with the shift in position and the distortion in shape of the input patterns. The neocognitron…gives a drastic solution to this difficulty,” Fukushima wrote.

尽管取得了进展,但神经认知机的训练算法(仅调整 S 细胞的权重)很繁琐,需要精细调整,有点定制化。大约十年后,现在在多伦多与 Hinton 合作的年轻博士后 Yann LeCun 解决了同样的问题,他使用了一种神经网络架构,该架构成为他对人工智能的标志性贡献之一:卷积神经网络。与神经认知机不同,CNN 使用反向传播算法进行训练。LeCun 的论文发表几年后,他遇到了福岛。“他告诉我,当他看到我们在《神经计算》杂志上的论文时,他和他的学生都很震惊,因为他们实际上在研究同一件事,”LeCun 告诉我。福岛被抢先报道了。

Despite the advance, the neocognitron’s training algorithm, which adjusted only the weights of the S-cells, was cumbersome, finely tuned, somewhat bespoke. Then, about a decade later, the “kid in France,” Yann LeCun, now a young postdoc with Hinton in Toronto, solved the same problem, using a neural network architecture that became one of his signature contributions to AI: the convolutional neural network. The CNN was trained using the backpropagation algorithm, unlike the neocognitron. A few years after LeCun’s paper was published, he met Fukushima. “He told me that when he saw our paper, in Neural Computation, he and his students were shocked, because they were actually working on the same thing,” LeCun told me. Fukushima had been scooped.

莱纳特

THE LeNET

到目前为止,马文·明斯基和西摩·帕普特肯定被视为在近十年的时间里阻碍神经网络研究的罪魁祸首。因此,帕普特成为勒昆的思想英雄之一,这多少有点令人惊讶。当勒昆还是一名在巴黎学习电气工程的学生时,他偶然读到了《语言与学习:让·皮亚杰和诺姆·乔姆斯基之间的争论》一书。1975年 10 月,皮亚杰和乔姆斯基两位对认知本质等问题持有不同观点的思想巨人在罗亚蒙特修道院会面,该修道院位于伦敦以北约 30 公里处。巴黎。除了皮亚杰和乔姆斯基,还有其他杰出的思想家,包括帕珀特。他们辩论的问题之一是我们的认知能力主要是天生的(乔姆斯基的立场),还是在存在一些先天生物机制的情况下,在发展过程中学习的结果(皮亚杰的立场)。例如,乔姆斯基关于语言的基本论点之一是,语言中的大多数句法结构都是天生的,而不是后天习得的。皮亚杰则不这么认为。在辩论中,帕珀特站在皮亚杰的阵营。他认为乔姆斯基没有给予学习应有的重视。我认为乔姆斯基倾向于认为某些句法结构是“不可学的”,因为他学习过程的基本范式太简单、太局限。如果唯一的学习过程是他似乎认识到的那些过程,那么这些句法结构可能确实是天生的!”

By now, Marvin Minsky and Seymour Papert must come across as the villains who derailed research into neural networks for a good part of a decade. So, it’s somewhat surprising that Papert is one of LeCun’s intellectual heroes. When LeCun was still a student studying electrical engineering in Paris, he chanced upon the book Language and Learning: The Debate Between Jean Piaget and Noam Chomsky. In October 1975, Piaget and Chomsky, two intellectual giants with divergent views on the nature of cognition, among other things, met at Royaumont Abbey, about thirty kilometers north of Paris. Alongside Piaget and Chomsky were other prominent thinkers, including Papert. One of the questions they were debating was whether our cognitive capabilities are mainly innate (Chomsky’s position) or whether they are the outcome of learning that happens during development, given some small nucleus of innate biological mechanisms (Piaget’s position). For example, one of Chomsky’s basic arguments about language is that most syntactic structures in language are innate, not learned. Piaget thought otherwise. Papert, during the debate, was in Piaget’s camp. He felt Chomsky wasn’t giving learning its due. “I believe that Chomsky is biased toward perceiving certain syntactic structures as ‘unlearnable’ because his underlying paradigm of the process of learning is too simple, too restricted. If the only learning processes were those he seems to recognize, these syntactic structures might indeed have to be innate!”

帕普特呼吁更清楚地解释“天生”的含义。”我将通过描述一个自动机(一种我们非常了解的机器)来做到这一点,并询问机器中什么是天生的,什么不是天生的。如果这个问题在这种“玩具”情况下都不清楚,那么在人类发展的复杂情况下,它还需要多少澄清?“他争辩道。“这个机器被称为感知器。”

Papert called for greater clarity about what it means to say something is innate. “I will do this by describing an automaton, a machine that we understand quite thoroughly, and asking questions about what is and what is not innate in the machine. If the question is unclear even in this ‘toy’ situation, how much more clarification does it need in the complex situation of human development?” he argued. “The machine in question is called a perceptron.”

然后佩珀特继续描述了罗森布拉特的感知器。

And then Papert went on to describe Rosenblatt’s perceptron.

在读本科期间,在读到皮亚杰-乔姆斯基辩论和帕普特论证的书之前,LeCun 就对智能产生了浓厚的兴趣。“我一直对智能的奥秘着迷,”他告诉我。“也许是因为我太笨或太懒,我一直认为人类工程师不够聪明,无法构思和设计智能机器。它必须通过学习来设计自己。我认为学习是智能的重要组成部分。”

While still an undergrad, and before he had read the book about the Piaget-Chomsky debate and Papert’s arguments, LeCun had been bitten by the intelligence bug. “I was always fascinated by the mystery of intelligence,” he told me. “Maybe because I’m too stupid or too lazy, I always thought that human engineers would not be smart enough to conceive and design an intelligent machine. It will have to basically design itself through learning. I thought learning was an essential part of intelligence.”

佩珀特使用感知器进行的分析正是沿着这样的思路进行的。然而,LeCun 从未听说过学习机器。他对此很着迷。他开始深入研究学术文献,并仔细阅读图书馆书籍。他阅读了有关感知器的文章,还读了明斯基和帕普特的书。“我很早就意识到,因为我读过所有那些旧论文,所以 60 年代的每个人都在寻找一种训练多层网络的方法,”LeCun 告诉我。“他们知道限制是由于 [他们] 被线性分类器所困扰。”

Papert’s analysis, using perceptrons, was right along those lines. LeCun, however, had never heard of learning machines. He was fascinated. He began digging through academic literature and poring over library books. He read about perceptrons, and he read Minsky and Papert’s book. “I realized pretty early, because I read all those old papers, that everybody in the sixties was looking for a way to train multi-layer nets,” LeCun told me. “They knew that the limitation was due to the fact that [they] were stuck with linear classifiers.”

LeCun 发现并记住了我们在前面章节中已经遇到过的机器学习圣经的一部分,即 Duda 和 Hart 的《模式识别》。他告诉我,他从所有阅读中得到的关键结论是“学习算法应该最小化目标函数。你可以从中得出一大堆东西。”

LeCun discovered and memorized part of the ML bible we have already encountered in previous chapters, Duda and Hart’s Pattern Recognition. His key takeaway from all his reading, he told me, was that a “learning algorithm should minimize an objective function. You can derive a whole bunch of stuff from that.”

目标函数是对损失函数的一个微小但重要的改变。我们已经遇到过后者:它是一个函数,它接受 ML 模型的参数,然后计算损失,即整个训练数据集的均方误差 (MSE)。我们看到了如何最小化或优化损失函数。仅使用损失函数会带来一个固有问题:如果优化得太好,ML 模型可能会过度拟合数据;它实际上可以记住所有内容。在这种情况下,它在对以前从未见过的测试数据进行预测时可能会表现不佳。为了避免这种情况,可以向损失函数添加一个额外的项,称为正则化项。该项旨在使 ML 模型避免过度拟合。损失函数和正则化项一起构成了目标函数。最小化目标函数(而不仅仅是纯粹的损失函数)会导致模型能够更好地推广到看不见的数据。

An objective function is a slight but significant change to the loss function. We have already encountered the latter: It’s a function that takes in the ML model’s parameters and then calculates the loss, say, as the mean squared error (MSE) over the entire training dataset. We saw how the loss function can be minimized or optimized. Working only with the loss function comes with an inherent problem: Do too well on the optimization, and your ML model can overfit the data; it can literally memorize everything. In which case, it might perform badly when making predictions on test data it hasn’t seen before. To avoid this, one can add an extra term to the loss function, something called a regularizer. This term is designed to make the ML model avoid overfitting. The loss function and the regularizer taken together constitute the objective function. Minimizing the objective function, instead of just a pure loss function, leads to a model that can generalize better to unseen data.

在攻读博士学位期间,LeCun 一直在思考如何最小化多层神经网络的目标函数,并了解到了福岛的神经认知机。在攻读博士期间,LeCun 开发了一种学习算法,他后来意识到,这种算法与反向传播算法(我们在上一章中看到过)。LeCun 的算法不是反向传播梯度,也不是使用链式法则计算所有偏导数,而是反向传播每个隐藏单元的“虚拟目标值”。然后,该算法可以计算每个单元的误差和执行更新所需的梯度。在特殊条件下,该算法的行为类似于反向传播。在攻读博士学位期间,LeCun 开始思考用于不变图像识别的神经网络(我们刚刚看到的那种)。

With minimizing objective functions for multi-layer neural networks on his mind, LeCun started his Ph.D. and learned about Fukushima’s neocognitron. As part of his doctoral work, LeCun developed a learning algorithm that, he would realize, was related to the backpropagation algorithm (which we saw in the previous chapter). Instead of backpropagating gradients, or calculating all the partial derivatives using the chain rule, LeCun’s algorithm backpropagated “virtual target values” for each hidden unit. The algorithm could then calculate an error for each unit and the requisite gradient to perform an update. Under special conditions, the algorithm behaves like backpropagation. While getting his Ph.D., LeCun began thinking about neural networks for invariant image recognition (of the kind we just saw).

1985 年,他在法国的一次会议上发表了一篇关于他的学习算法的论文。“那是一篇用法语写的很糟糕的论文,”他告诉我。作为会议主讲人的 Hinton 找到了 LeCun,两人一拍即合,甚至互相补充了对方的句子,LeCun 回忆道。Hinton 邀请 LeCun 参加他在 1986 年在匹兹堡的卡内基梅隆大学组织的暑期学校。在那里,Hinton 告诉 LeCun 他要去多伦多大学,他问 LeCun 是否愿意和他一起在那里做博士后。“我说,‘当然。’”

He presented a paper on his learning algorithm at a conference in France in 1985. “It was [a] badly written paper in French,” he told me. Hinton, who was the keynote speaker at the conference, sought out LeCun, and the two hit it off, even completing each other’s sentences, LeCun recalled. Hinton invited LeCun to come to a summer school he was organizing in 1986 at Carnegie Mellon University, in Pittsburgh. There, Hinton told LeCun that he was moving to the University of Toronto; he asked if LeCun would join him there as his postdoc. “I said, ‘Of course.’ ”

LeCun 于 1987 年获得博士学位并搬到多伦多,在那里他与 Hinton 一起度过了一段智力上令人兴奋的时光,两人的对话“几乎只有我和他才有”。正是在多伦多,LeCun 开始研究用于图像识别的卷积神经网络,简称为卷积网络。(我们稍后会讲到“卷积”的含义。)如今,如果软件工程师想要实现一个卷积网络,他们可以用不到一百行代码来实现,这要归功于 PyTorch 和 TensorFlow 等软件包。20 世纪 80 年代中期还没有这样的软件。LeCun 和一位名叫 Léon Bottou 的博士生同学必须编写特殊软件来模拟神经网络。它被称为 SN,最终成为 Lush,即现代 PyTorch 的前身之一。但在 1987 年,SN 是一件大事。“它给了我们超能力。没有人“没有其他人有类似的东西,”LeCun 告诉我。“它对于构建第一个卷积网络确实起到了至关重要的作用。”

LeCun finished his Ph.D. in 1987 and moved to Toronto, where he enjoyed an intellectually stimulating time with Hinton, the two having conversations “that I pretty much only had with him.” It was in Toronto that LeCun began working on convolutional neural networks for image recognition, or conv nets. (We’ll come to what “convolution” means in a minute.) These days, if a software engineer wants to implement a conv net, they can do so with fewer than a hundred lines of code, thanks to software packages such as PyTorch and TensorFlow. No such software existed in the mid-1980s. LeCun and a fellow doctoral student, named Léon Bottou, had to write special software to simulate neural networks. Called SN, it would eventually become Lush, one of the ancestors of the modern-day PyTorch. But in 1987, SN was a huge deal. “That [gave] us superpowers. Nobody else had anything like this,” LeCun told me. “It really was very instrumental in building the first conv net.”

在多伦多工作不到一年,LeCun 就被新泽西州霍姆德尔的贝尔实验室招募,加入了由 Larry Jackel 领导的杰出团队。在贝尔实验室,LeCun 接触到了一个庞大而有趣的数据集:美国邮政局的手写数字图像。美国邮政局对自动化识别邮政编码的过程很感兴趣。LeCun 编写了一个神经网络来识别这些手写数字。当时的个人电脑还不够快,无法运行这种计算机密集型软件,因此他使用 Lisp 编程语言编写了一个编译器,该编译器会接收要实现的神经网络的定义(或架构),然后输出 C 编程语言的代码。然后,C 编译器会将这些代码转换为可以在硬件数字信号处理器上运行的低级指令。

Within a year of being at Toronto, LeCun was recruited by Bell Labs in Holmdel, New Jersey, where he joined an illustrious group led by Larry Jackel. At Bell Labs, LeCun got access to a large, intriguing dataset: images of handwritten digits from the U.S. Postal Service. The USPS was interested in automating the process for recognizing zip codes. LeCun coded a neural net to recognize these handwritten digits. PCs were still not fast enough to run such computer-intensive software, so he wrote a compiler using the Lisp programming language, which would take in the definition (or architecture) of the neural network to be implemented and spit out code in the C programming language. A C compiler would then turn this code to low-level instructions that could run on a hardware digital signal processor.

与此同时,他的同事唐尼·亨德森 (Donnie Henderson) 制作了一个演示,使用摄像机将纸上潦草书写的手写数字转换为数字图像,以供神经网络识别。所有这些工作都在 LeCun 进入贝尔实验室后的几个月内完成。他回忆起看着自己的神经网络识别手绘数字的经历:“我并不怀疑它会成功。我当时年轻无畏,”他告诉我。尽管如此,他还是“非常兴奋”。这项工作产生了几篇论文,其中包括一篇发表在《神经计算》上的论文,这篇论文震惊了福岛的神经认知机团队,并让他们获得了独家报道。其成果是一个卷积神经网络,今天被称为LeNet;所有现代 CNN 都是它的后代。

Meanwhile, a colleague, Donnie Henderson, put together a demo that used a video camera to turn handwritten digits scribbled on a piece of paper into digital images for the neural network to recognize. All this work happened in the space of a few months after LeCun got to Bell Labs. He recalled the experience of watching his neural network recognize hand-drawn digits: “It’s not that I had any doubt it would work. I was young and fearless,” he told me. Nonetheless, he was “absolutely elated.” The work resulted in a couple of papers, including the one in Neural Computation that shocked, and scooped, Fukushima’s neocognitron team. The outcome was a convolution neural network that today is called LeNet; all modern-day CNNs are its descendants.

进行卷积

DOING THE CONVOLUTION

如今,即使是最复杂的卷积神经网络,其核心也是一个非常基本的运算:卷积。术语来自于一个可以使用两个函数(比如f ( x ) 和g ( x ))执行的特殊运算:

At the heart of even the most sophisticated convolution neural network today is a very basic operation: the convolution. The term comes from a special operation that can be performed using two functions, say, f(x) and g(x):

f ( x )* g ( x ),其中“*”是卷积算子。

f(x) * g(x), where “*” is the convolution operator.

我们关心的不是函数卷积的一般情况,而是适用于图像的非常特殊的 2D 情况。假设您有一张 5×5 的图像。图像上下文中的卷积是使用另一个较小的图像(例如 2×2)对图像执行的操作,该图像称为内核或内核过滤器。以下是此类图像和内核的示例:

We aren’t concerned with the general case of convolutions of functions but, rather, with a very specific 2D case that’s applicable to images. Let’s say you have a 5×5 image. Convolution in the context of an image is the operation performed on the image using another, smaller—say, 2×2—image, which is called a kernel or kernel filter. Here’s an example of such an image and kernel:

卷积过程从我们将 2×2 核放在 5×5 图像的左上角开始;这样我们就得到了四个重叠像素。将核的每个像素与其下方像素的值相乘。您将得到四个数字。将它们相加。总和将为您提供新图像的像素值,位于位置 [1, 1]。(新图像的大小有一个公式,但现在我们先将其放在一边,并假设它是 4×4 图像。)这是第一个操作:

The process of convolution starts by our placing the 2×2 kernel on the top-left corner of the 5×5 image; this gives us four overlapping pixels. Multiply each pixel of the kernel with the value of the pixel lying beneath it. You get four numbers. Sum them up. The sum gives you the pixel value for a new image, at location [1, 1]. (There’s a formula for the size of the new image, but let’s leave that aside for now and take it that it’s a 4×4 image.) Here’s the first operation:

现在将内核向右滑动一个像素。(内核向右滑动的确切量可以更改,但我们可以使用一个像素,不失一般性。)同样,我们有四个重叠像素。将每对重叠像素相乘并相加。现在总和给出了位置 [1, 2] 处新图像的像素值。

Now slide the kernel one pixel to the right. (The exact amount by which the kernel slides to the right can be changed, but we can use one pixel without any loss of generality.) Again, we have four overlapping pixels. Multiply each pair of overlapping pixels and sum them up. Now the sum gives you the pixel value for the new image at location [1, 2].

继续将内核向右滑动一个像素,为新图像生成一个新的像素值,直到无法再将内核向右滑动。对于我们的示例 5×5 图像和 2×2 内核,从左到右我们只能生成四个像素。

Keep sliding the kernel one pixel to the right, generating a new pixel value for the new image, until you cannot slide the kernel to the right any farther. For our example 5×5 image and 2×2 kernel, we can generate only four pixels as we go from left to right.

一旦核函数到达右侧图像的末尾,我们就返回左侧并向下滑动一行,然后重复整个过程。这将生成新图像第二行的像素值。

Once the kernel hits the end of the image on the right, we go back to the left and slide down one row and repeat the entire process. This generates the pixel values for the second row of the new image.

你明白了。我们这样做,直到内核不能再向右移动,然后我们回到左边,将内核向下滑动一个像素,继续,直到我们到达主图像已经不存在,内核已经没有地方可以放置了。在本例中,我们生成一个新的 4×4 图像。尝试填充新图像中剩余的空白区域。

You get the picture. We do this until the kernel cannot move right any farther and then we go back to the left, slide the kernel down one pixel, and continue until we reach the bottom right of the main image and there’s no place left for the kernel to go. In this example, we generate a new 4×4 image. Have a go at filling up the rest of the blank spaces in the new image.

我们刚刚将 5×5 图像与 2×2 核进行卷积。在我们进一步了解卷积,下页的图显示了使用两个不同的 3×3 核对 28×28 手写数字图像进行卷积的示例。(第一张图像是原始图像,后面两张是卷积图像。)

We’ve just convolved the 5×5 image with the 2×2 kernel. Before we go into more details about the convolution, the figures on the following page show an example of convolving 28×28 images of handwritten digits, with two different 3×3 kernels. (The first image is the original, and the following two are convolved images.)

抛开核的具体细节,您是否注意到新图像有什么特别之处?首先,不明显的是新图像是 26×26(卷积操作缩小了尺寸)。但更重要的是,视觉上显而易见的是,特别是如果您眯着眼睛看图像,在第一个卷积图像中,数字 4 的水平线被突出显示;而在第二张卷积图像中,垂直线被突出显示。当对数字 1 执行卷积操作时,您可以以更明显的方式看到相同的效果。(同样,最左边的图像是原始图像,后面是两个卷积图像。)

Leaving aside the specifics of the kernels, do you notice anything in particular about the new images? First, what’s not obvious is that the new images are 26×26 (the convolution operation reduced the size). But more important, what’s visually obvious, especially if you squint at the images, is that in the first convolved image, the horizontal line of the digit 4 is highlighted; whereas in the second convolved image, the vertical lines are highlighted. You can see the same effect, in a more pronounced manner, when the convolution operation is performed on the digit 1. (Again, the leftmost image is the original, followed by the two convolved images.)

所选的内核专门用于实现这些亮点。这两个内核是:

The kernels were chosen specifically to achieve these highlights. The two kernels are:

这些被称为 Prewitt 内核,以开发者的名字命名。这些内核在卷积后成功生成了新的图像,可以检测水平和垂直边缘。请记住,这些都是手工设计的内核。LeCun 希望他的神经网络能够学习这样的内核。

These are called Prewitt kernels, after their developer. These kernels succeed in generating new images, after the convolution, that detect horizontal and vertical edges. Keep in mind, for now, that these are hand-designed kernels. LeCun wanted his neural network to learn such kernels.

在进行卷积时,我们做了一些假设。首先,内核向右或向下移动一个像素。内核移动的像素数称为其步幅。我们的步幅为 1。我们可以选择 2,这样我们就会生成一个不同大小的新图像。步幅的选择决定了新图像的大小;因此,对于输入图像大小i、内核过滤器大小k和步幅s,输出图像大小由以下公式给出:

While doing the convolution, we made some assumptions. The first is that the kernel moved right or down by one pixel. The number of pixels by which the kernel moves is called its stride. We had a stride of 1. We could have chosen 2, and we’d have generated a new image of a different size. The choice of stride dictates the size of the new image; so, for an input image size i, kernel filter size k, and stride s, the output image size is given by:

⌊(( i - k )/ s ) + 1⌋

⌊((i - k)/s) + 1⌋

(某个数字周围的下限括号,用 ⌊某个数字⌋ 表示,计算结果为小于或等于括号内数字的最大整数。因此,由 ⌊4.3⌋ 给出的 4.3 的下限为 4。)

(The floor brackets around some number, denoted by ⌊some number⌋, evaluate to the largest integer that’s less than or equal to the number inside the brackets. So, the floor of 4.3, given by ⌊4.3⌋, is 4.)

我们在这里做了一些其他假设:输入图像是正方形(即图像宽度等于高度),并且我们不会在输入图像周围添加虚拟像素(称为填充),这在卷积网络中经常做。对于 28×28 图像、3×3 内核和 1 的步幅,我们得到 26×26 图像作为输出。

We are making some other assumptions here: The input image is square (i.e., image width equals height), and we are not adding dummy pixels, called padding, around the input image, which is often done in convolutional networks. For 28×28 images, a 3×3 kernel, and a stride of 1, we get 26×26 images as output.

现在是时候将这些想法与神经元、它们的权重、接受域的概念等联系起来了。考虑一个单一的操作,当内核位于图像的某个部分之上时(见下图):

It’s time to connect these ideas to neurons, their weights, the idea of receptive fields, and so on. Consider a single operation, when the kernel is atop some portion of the image (see figure below):

主图像中的像素值为 [ x11, x12, x21, x22 ]。内核的像素值为 [ w11, w12, w21, w22 ]。以下是将内核与底层像素相乘的结果:

The pixels in the main image have values [x11, x12, x21, x22]. The kernel’s pixels have values [w11, w12, w21, w22]. Here’s the result of multiplying the kernel with the underlying pixels:

11 x 11 +12 x 12 +21 x 21 +22 x 22

w11x11 + w12x12 + w21x21 + w22x22

这一定让你想起了人工神经元所做的操作。神经元的权重是核的各个元素的值。神经元的输入是核悬停的主图像部分的像素值。神经元的输出只是这些像素的加权和。

This must remind you of the operation done by an artificial neuron. The weights of the neuron are the values of the individual elements of the kernel. The inputs to the neuron are the pixel values of that part of the main image over which the kernel is hovering. The output of the neuron is simply the weighted sum of those pixels.

因此,对于内核在图像上的每个位置,我们都会分配一个神经元。在我们的示例中,对于 5×5 图像和步幅为 1 的 2×2 内核,我们需要 16 个这样的神经元。这些神经元的输出为我们提供了 4×4 图像。

So, for every position the kernel takes atop the image, we assign one neuron. In our example, for a 5×5 image and a 2×2 kernel with a stride of 1, we need 16 such neurons. The outputs of these neurons give us a 4×4 image.

这就是它与接受域概念的联系。每个神经元只关注图像的特定部分:左上角或右上角的四个像素,左下角或右下角的四个像素,或者中间任意位置的四个像素。每个神经元在图像中都有自己感兴趣的区域,这就是它的接受域。它只会对这些像素做出反应。

Here’s how this connects to the idea of receptive fields. Each neuron is paying attention only to a particular part of the image: either the four pixels in the top-left or top-right corner or in the bottom-left or bottom-right corner or the four pixels anywhere in the middle. Each neuron has its own region of interest in the image, and that is its receptive field. It’ll respond only to those pixels.

当然,神经元的输出不仅取决于其接受域中的像素值,还取决于其权重或核矩阵的元素。我们看到了两个核的示例,一个核在神经元的接受域中存在水平边缘时产生输出,另一个核在神经元的接受域中存在垂直边缘时产生输出。

The output of a neuron will depend, of course, not just on the values of the pixels in its receptive field, but also on its weights, or the elements of the kernel matrix. We saw two examples of kernels, one that causes an output when there’s a horizontal edge and the other when there’s a vertical edge, in the receptive fields of the neurons.

对于 5×5 图像和 2×2 核的示例,我们有一层 16 个神经元,其输出形成 4×4 图像。这 16 个神经元构成神经网络的第一个隐藏卷积层。这些神经元具有相同的权重值,就像简单的细胞:每个神经元都会对其感受野中的某种模式做出反应。

For our example of a 5×5 image and a 2×2 kernel, we have a layer of 16 neurons, the outputs of which form a 4×4 image. The 16 neurons form the first hidden convolution layer of a neural network. These neurons, which share the same set of values for their weights, are like simple cells: Each responds to some pattern in its receptive field.

现在想象一下,取第一次卷积后获得的 4×4 图像,并使用不同的 2×2 核应用另一次卷积。输出将是 3×3 图像。这将需要 9 个神经元。这是第二个隐藏卷积层。此层中的每个神经元相当于 Hubel 和 Wiesel 层次结构中的复杂细胞。此层中的每个神经元都对前一层生成的 4×4 图像中的 4 个像素的值敏感。但 4×4 图像中的每个像素都是对前一个输入图像中的 4 个像素敏感的神经元的结果。因此,第二层中的神经元实际上对输入图像中的 9 个像素敏感。为什么感受野不等于 4×4 = 16 像素?稍加思考就能找到答案。

Now imagine taking the 4×4 image obtained after the first convolution and applying another convolution using a different 2×2 kernel. The output will be a 3×3 image. This will require 9 neurons. This is the second hidden convolution layer. Each neuron in this layer is the equivalent of a complex cell in Hubel and Wiesel’s hierarchy. Each neuron in this layer is sensitive to the value of 4 pixels in the 4×4 image generated by the previous layer. But each pixel in the 4×4 image was the outcome of a neuron that was sensitive to 4 pixels in the previous, input image. So, the neuron in the second layer is effectively sensitive to 9 pixels in the input image. Why is the receptive field not equal to 4×4 = 16 pixels? A moment’s thought will give you the answer.

在上图中,各个单元格中的数字指的是图像的行号和列号,而不是像素的值。因此,11 表示第 1 行、第 1 列;43 表示第 4 行、第 3 列;依此类推。因此,最终 3×3 图像中的像素 11 是内核对前一个图像中的四个像素(像素 11、12、21 和 22)进行操作的结果。但这四个像素是滑动 2×2 的结果内核覆盖前一层的四个块,覆盖像素 11、12、13、21、22、23、31、32 和 33。

In the illustration above, the numbers in the individual cells refer to the row and column numbers of the image, not to the value of the pixels. So, 11 means row 1, column 1; and 43 means row 4, column 3; and so on. So, pixel 11 in the final 3×3 image is the outcome of a kernel operating on four pixels in the previous image (pixels 11, 12, 21, and 22). But those four pixels are the outcome of sliding the 2×2 kernel across four patches in the previous layer, covering pixels 11, 12, 13, 21, 22, 23, 31, 32, and 33.

如果我们只取原始图像的 3×3 块(以粗线突出显示),以下是神经网络中的连接,它们可以将此块转换为单个像素。首先,我们将像素排成一条直线,以便于可视化,然后将这些像素连接到各自的神经元。很明显,第一个隐藏层中的每个神经元只对 4 个像素做出反应。当然,整个层将有 16 个神经元;插图仅显示其中的 4 个。4 个神经元生成下一个 4×4 图像的 4 个像素。这些像素/输出成为下一层神经元的输入,然后为后续图像生成一个像素。

If we take just the 3×3 patch of the original image, highlighted in bold lines, here are the connections in a neural network that can transform that patch into a single pixel. First, we lay out the pixels in a straight line for easy visualization and, then, connect these pixels to their respective individual neurons. It’s clear how each neuron in the first hidden layer is only responding to four pixels. Of course, the full layer will have 16 neurons; the illustration shows only 4 of them. The 4 neurons generate 4 pixels of the next 4×4 image. These pixels/outputs become the input to the neuron in the next layer, which then produces one pixel for the subsequent image.

你能明白为什么核操作在这种情况下能给我们带来平移不变性吗?假设核检测到水平边缘。边缘可以位于图像中的任何位置。只要它落在一个或多个神经元的接受域内,我们就会从第一层中的至少一个神经元获得信号。垂直边缘也是如此。后续层可以对这些信号起作用。

Can you see why the kernel operation gives us, in this case, translation invariance? Let’s say the kernel detects a horizontal edge. The edge can be anywhere in the image. As long as it falls within the receptive field of one or more neurons, we’ll get a signal from at least one of the neurons in the first layer. The same goes for vertical edges. Layers that follow can act on these signals.

该建筑让人想起 Hubel 和维塞尔假设的神经元存在于我们的大脑中。第一个隐藏层的简单细胞或神经元对简单特征作出反应。下一个隐藏层的复杂细胞对一组简单细胞的输出作出反应,因此它们对一些更简单的特征的组合作出反应。这种层次结构可以一直延续,直到你有一个神经元触发,因为特征的持续组合表明,比如说,输入中存在数字 1 或数字 4。

The architecture is reminiscent of the hierarchy that Hubel and Wiesel had posited exists in our brains. The simple cells, or neurons, of the first hidden layer respond to simple features. Complex cells in the next hidden layer respond to the outputs of a group of simple cells, so they are responding to some composition of the simpler features. This hierarchy can keep going until you have a neuron that fires because the continued composition of features indicates, say, the presence of the digit 1 or the digit 4 in the input.

我们使用的内核是由某个花了不少心思的人专门设计的。设计内核来检测简单特征(例如垂直或水平边缘)相对简单。但对于复杂的图像呢?应该寻找哪些特征来区分一幅图像和另一幅图​​像?我们如何设计这样的内核?

The kernels we used were specially designed by someone who put a lot of thought into them. And it’s relatively straightforward to design kernels to detect simple features, such as vertical or horizontal edges. But what about complex images? What features should one be looking for that would distinguish one image from another? And how do we design such kernels?

正是在此时,LeCun 的洞察力开始显现。人类不可能找出定义图像的所有无数特征,然后设计内核来突出这些特征。LeCun 意识到他可以训练神经网络来学习这些内核;毕竟,每个内核矩阵的元素都是单个神经元的权重。使用反向传播训练网络来执行某些任务,本质上会帮助网络找到合适的内核。

This is where LeCun’s insight came to the fore. There’s no way a human can figure out all the myriad features that define an image and then design kernels to highlight those features. LeCun realized he could train a neural network to learn these kernels; after all, the elements of each kernel matrix are the weights of individual neurons. Training a network using backpropagation to do some task would, in essence, help the network find the appropriate kernels.

在将所有内容整合在一起之前,我们需要了解卷积神经网络中另一个常用的操作。这个操作称为池化,池化有几种类型,但为了从概念上理解该过程,我们将重点介绍其中一种。它称为最大池化。

We need to understand one more commonly used operation in convolution neural networks before we put all the pieces together. It’s an operation called pooling, of which there are a few types, but we’ll focus on one in order to understand the process conceptually. It’s called max pooling.

最大池化的基本思想是将过滤器(内核的另一个名称)放置在原始图像的某个部分上,然后简单地输出过滤器下方区域中的最大像素值。最大池化应用于由一个卷积阶段生成的图像。这会缩小图像的大小进一步,并具有两个巨大的好处。首先,卷积的下一阶段所需的神经元数量减少了;其次,最大池化阶段之后的神经元的感受野进一步增加,有助于实现平移不变性。

The basic idea behind max pooling is to place a filter (another name for a kernel) over some part of the original image and then simply spit out the largest pixel value in the region below the filter. Max pooling is applied to an image that has been produced by one stage of a convolution. This brings down the size of the image further and has two huge benefits. First, the number of neurons you require for the next stage of convolution is reduced; and second, the receptive field of a neuron that comes after the max pooling stage is increased even more, helping with translation invariance.

给定一个 4×4 图像(例如,在卷积后获得),使用 2×2 过滤器的最大池化如下所示。与卷积核不同,最大池化过滤器的移动方式使得没有重叠像素,因此步幅与内核大小相同。在我们的示例中,步幅为 2 个像素。

Given a 4×4 image (obtained, say, after a convolution), here’s what max pooling would look like with a 2×2 filter. Unlike the convolution kernel, the max pooling filter moves such that there are no overlapping pixels, so the stride is the same as the kernel size. In our example, the stride is 2 pixels.

同样,新图像的尺寸由一个简单的公式给出(假设图像为方形且没有填充):

Again, the size of the new image is given by a simple formula (assuming square images and no padding):

⌊(( i - k )/ s ) + 1⌋

⌊((i - k)/s) + 1⌋

该公式表示,对于输入图像大小i = 4、池化滤波器大小k = 2 和步幅s = 2,输出图像大小为 2。因此,4×4 图像变成 2×2 图像。

The formula, for an input image size i = 4, pooling filter size k = 2, and stride s = 2, gives us an output image size of 2. So, a 4×4 image turns into a 2×2 image.

有了卷积神经网络的所有这些元素,我们可以组合一个 CNN 来识别手写数字。

With all these elements of a convolutional neural network in hand, we can put a CNN together to recognize handwritten digits.

特色

DISTINGUISHING FEATURES

正如 LeCun 所意识到的,卷积神经网络的强大之处在于其学习特征的能力。学习特征涉及学习内核的值,正如我们所见,这等同于学习一组神经元的权重。确保网络强大的一种方法是拥有大量这样的内核,每个内核都学习区分完成识别手写数字等任务所必需的不同特征。以下是一个简单的架构,它触及了卷积神经网络的概念核心,具有多个内核:

As LeCun realized, the power of a convolutional neural network lies in its ability to learn features. Learning a feature involves learning the values for a kernel, which, as we saw, is the same as learning the weights of a bunch of neurons. One way to ensure that your network is powerful is to have a large number of such kernels, each learning to distinguish a different feature that’s necessary to accomplish, say, the task of recognizing handwritten digits. What follows is a simple architecture that gets at the conceptual heart of a convolutional neural network, with multiple kernels:

灰度输入图像由五个不同的内核处理。上图隐含的意思是,卷积和最大池化是由一组神经元完成的;为了清晰起见,图中没有显示。反向传播可用于学习这些神经元的权重。每个卷积都会生成一个 24×24 的图像;我们有五个内核,因此,网络可能会学习在输入图像中寻找五个不同的特征。卷积之后是最大池化层。每个最大池化操作的输出都会生成一个12×12 图像,共有 5 个这样的层。 (补充一点:最大池化层在训练期间不需要学习参数或权重,但我们需要跟踪前一层中最大像素值的位置,以便使用链式法则并反向传播梯度。)在卷积加最大池化组合的最后,是时候根据可能检测到的特征做出决策了。最大池化生成的所有图像的像素首尾相连,组成一个长度为 720(12×12×5)的向量。这些输入送入第一个全连接层(FC1)。假设 FC1 有 50 个神经元。每个神经元接收 720 个输入并产生一个输出。该图仅显示接收这些输入的顶部神经元。

The grayscale input image is processed by five different kernels. What’s implicit in the above figure is that convolutions and max pooling are being done by bunches of neurons; for clarity, they are not shown. Backpropagation can be used to learn the weights of these neurons. Each convolution generates a 24×24 image; we have five kernels, so, potentially, the network will learn to look for five different features in the input image. After convolutions comes the max pooling layer. The outputs of each max pooling operation generate a 12×12 image, and there are five of these. (A geeky aside: There is no parameter or weight to be learned for the max pooling layer during training, but we need to keep track of the location of the maximum pixel value in the previous layer, in order to use the chain rule and backpropagate the gradient.) At the end of the convolution-plus-max-pooling combo, it’s time to make decisions based on the features that might have been detected. The pixels of all the images generated by max pooling are laid end to end, making up a vector of length 720 (12×12×5). These inputs feed into the first fully connected layer (FC1). Let’s say FC1 has 50 neurons. Each neuron receives 720 inputs and produces an output. The figure shows only the top neuron receiving these inputs.

FC1 的 50 个神经元的输出被馈送到具有 10 个神经元的 FC2:因此,FC2 的每个神经元接收 50 个输入。最后,我们从 FC2 获得 10 个输出。为什么是 10 个?因为我们的任务是识别手写数字。这个想法是,如果输入数字是 0,那么第 0 个神经元的激发强度比其他所有神经元都强;如果数字是 1,那么 1 号神经元的激发强度最大;依此类推。

The outputs of the 50 neurons of FC1 feed into FC2, which has 10 neurons: So, each neuron of FC2 receives 50 inputs. Finally, we get 10 outputs out of FC2. Why 10? Well, our task is to recognize handwritten digits. The idea is that if the input digit is 0, then the zeroth neuron fires more strongly than all the others; if the digit is 1, then neuron number 1 fires the strongest; and so on.

如果任务是区分两幅图像,那么只有一个神经元的 FC2 就足够了。它可以为一种类型的图像输出 0,为另一种类型的图像输出 1;例如,0 表示猫,1 表示狗。你如何训练这样的网络?对于我们的手写数字数据集,我们已经标记了数据——有人精心将每幅图像注释为 0、1 或 9 等等——因此,我们可以使用监督学习。向网络展示一张图片,并要求它预测数字。假设输入是数字 8。理想情况下,代表数字 8 的输出神经元应该比其他神经元更强烈地激发(这意味着,它输出的值应该明显大于 FC2 的其他神经元产生的值)。未经训练的网络会随意激发。计算预期和网络实际之间的误差,然后使用此误差计算通过反向传播,我们计算梯度。然后更新所有参与将输入图像转换为输出的神经元的权重。更新后的权重确保对于相同的输入,网络的误差比以前小一点。对所有图像重复此操作,直到网络的错误率低到可以接受的程度。如果我们计算训练数据集中所有图像的梯度并一次性更新权重,我们就是在进行梯度下降。如果我们每次通过网络时只使用图像的子集,我们就是在进行随机梯度下降——在损失景观中醉醺醺地走下去,直到达到足够好的最小值。

Had the task been to distinguish between two images, an FC2 with just one neuron would have sufficed. It could output 0 for one type of image and 1 for another type; say, 0 for cats and 1 for dogs. How do you train such a network? For our dataset of handwritten digits, we have labeled data—someone painstakingly annotated each image as being a 0 or a 1 or a 9 and so on—so, we can use supervised learning. Present the network with an image and ask it to predict the digit. Say, the input is the digit 8. Ideally, the output neuron representing the digit 8 should fire more strongly than the others (meaning, the value it outputs should be significantly greater than the values generated by the other neurons of FC2). An untrained network will fire willy-nilly. Calculate the error between what’s expected and what the network does, and then use this error to calculate the gradients, via backpropagation. Then update the weights of all the neurons that were involved in turning the input image into an output. The updated weights ensure that for the same input, the network’s error is a tiny bit less than before. Do this for all images, over and over, until the network’s error rate is acceptably low. If we calculate the gradients for all images in the training dataset and update the weights in one go, we are doing gradient descent. If we use only a subset of the images for each pass through the network, we are doing stochastic gradient descent—a drunken walk down the loss landscape to a good-enough minimum.

到目前为止还没有说清楚的是,这种网络的设计者必须对网络参数做出一系列决策,这些参数在训练过程中不会学习到,但这些决策对网络的性能影响巨大。例如,卷积层和全连接层的神经元具有激活函数。激活函数的选择就是这样一个决定。唯一的条件是激活函数应该是可微的,或者至少是近似可微的,以便实现梯度的反向传播。

What’s been left unsaid so far is that a designer of such a network has to make a host of decisions about parameters of the network that are not learned during the training process, decisions that nonetheless influence the network’s performance immensely. For instance, the neurons of convolution layers and the fully connected layers have activation functions. The choice of activation function is one such decision. The only condition is that the activation function should be differentiable, or at least approximately so, to enable the backpropagation of gradients.

这些手工选择的参数,包括内核过滤器的大小和数量、最大池化过滤器的大小和数量、卷积层和最大池化层的数量(在上面的例子中我们只有一对;可以继续堆叠它们)、完全连接层的大小和数量、激活函数——所有这些都构成了所谓的超参数。微调超参数或找到正确的值本身就是一门艺术。至关重要的是,这些不是通过反向传播来学习的。

These hand-chosen parameters, including the size and number of kernel filters, the size and number of max pooling filters, the number of convolution and max pooling layers (we had only a single pair in the example above; one can keep stacking them), the size and number of fully connected layers, the activation functions—all these constitute so-called hyperparameters. Fine-tuning, or finding the right values for, the hyperparameters is an art unto itself. Crucially, these are not learned via backpropagation.

LeCun 的 LeNet 比我们的例子稍微复杂一些,但也不是太复杂;他让它发挥作用。此外,它是一个深度神经网络,这意味着它有隐藏层。(在我们的例子中,输入和 FC2 之间的层是隐藏的。)NCR 使用了 LeNet公司为银行业读取和识别支票上的数字。“这是反向传播的少数几个应用之一,效果非常好,而且很有深度,”Hinton 告诉我。

LeCun’s LeNet was somewhat more complicated than our example, but not overly so; he made it work. Also, it was a deep neural network, meaning it had hidden layers. (The layers between the input and FC2, in our case, are hidden.) LeNet was used by NCR Corporation to read and recognize digits on checks for the banking industry. “That was one of the few applications of backprop that really worked pretty well, and was deep,” Hinton told me.

那是 20 世纪 90 年代初。尽管出现了 LeNet,但深度神经网络并未大获成功。部分原因是支持向量机的成功,它在当时的机器学习社区引起了轰动:SVM 易于理解,软件可用,并且对于当时的小型数据集,它们是理想的算法。而卷积神经网络对许多人来说仍然晦涩难懂。当然,当时没有通用软件可以用来构建这些 CNN。“你必须编写自己的深度学习框架,”LeCun 告诉我。“你无法从我们这里获得它,因为 AT&T 不允许我们分发开源软件。所以我们无法让人们重现我们的结果。因此,没有人这样做。”人们试图编写自己的深度学习框架。“你可以花一年时间编写一个软件来运行神经网络和卷积网络,有些人这样做了,”他说。

This was the early 1990s. Despite LeNet, deep neural networks didn’t hit the big time. Part of that was because of the success of support vector machines, which made a splash in the machine learning community at around the same time: SVMs were easy to understand, the software was available, and for small datasets of the time, they were an ideal algorithm. Convolutional neural networks, for their part, were still opaque and mysterious to many. And of course, no general-purpose software existed that one could leverage to build these CNNs. “You had to write your own deep learning framework,” LeCun told me. “You couldn’t get it from us, because AT&T would not let us distribute our software open source. So we could not enable people to reproduce our results. And as a consequence, nobody did.” People attempted to write their own deep learning frameworks. “You could start spending a year writing a piece of software to run neural nets and convolutional nets, and a few people did this,” he said.

与此同时,在整个 20 世纪 90 年代,LeCun 继续研究并提倡使用卷积神经网络进行图像识别,而不是更传统的技术。对于不需要强大神经网络的低分辨率图像,他的 CNN 表现优于其他算法。“我们在所有大型会议上发表了这些论文,但并没有产生太大影响,因为到那时,计算机视觉社区认为,‘哦,也许有一点不同,也许你的卷积网络工作正常。但我们会赶上我们的方法,’”他告诉我。

Meanwhile, throughout the 1990s, LeCun continued working on and advocating for convolutional neural networks for image recognition over more conventional techniques. For low-resolution images that didn’t need powerful neural networks, his CNNs were outperforming other algorithms. “We published those papers in all the big conferences, but it didn’t have a big impact, because by that time, the computer vision community was thinking, ‘Oh, maybe there’s a small difference, maybe your conv net works okay. But we are going to catch up with our methods,’ ” he told me.

此外,神经网络无法正常工作的情况也很有启发性。“有迹象表明存在规模问题,”他说。“当[图像]不是太大,神经网络也不是太大时,它们可以打败一切。”但是但高分辨率图像并非如此。识别高分辨率图像需要大型神经网络,而训练此类网络意味着必须处理数字,主要是以矩阵运算的形式。为了加快这一过程,大部分数字运算需要并行计算,但 20 世纪 90 年代计算机的中央处理器 (CPU) 无法胜任这项任务。不过,图形处理单元 (GPU) 的出现拯救了这一局面,它最初是作为专用于渲染 3D 图形的片上硬件而设计的。

Also, the situations where neural networks weren’t working as well were revealing. “There were signs that there was an issue of scale,” he said. “When [images] weren’t too big and the neural nets weren’t too large, they were beating the hell out of everything.” But not so with high-resolution images. Recognizing high-res images required large neural networks, and training such networks meant having to crunch numbers, mainly in the form of matrix manipulations. To make the process go faster, much of this number crunching required a form of parallel computing, but the central processing units (CPUs) of computers of the 1990s weren’t up to the task. However, saviors were on the horizon in the form of graphical processing units (GPUs), which were originally designed as hardware-on-a-chip dedicated to rendering 3D graphics.

GPU 被证明是改变深度学习面貌的关键。这种变化的最早迹象之一出现在 2010 年,来自 Jürgen Schmidhuber 及其同事,当时他们训练了多层感知器,该感知器有多达九个隐藏层和大约 1200 万个参数或权重,用于对 MNIST 图像进行分类。他们的错误率低至 0.35%。到目前为止,我们要实现最好的结果,需要很多隐藏层、每层很多神经元、大量变形的训练图像,以及能够大大加快学习速度的显卡,”该团队写道。

GPUs proved central to changing the face of deep learning. One of the earliest indications of this change came in 2010, from Jürgen Schmidhuber and colleagues, when they trained multi-layer perceptrons with as many as nine hidden layers and about 12 million parameters or weights, to classify MNIST images. They achieved error rates as low as 0.35 percent. “All we need to achieve this best result so far are many hidden layers, many neurons per layer, numerous deformed training images, and graphics cards to greatly speed up learning,” the team wrote.

但是,使用 GPU 克服相对较小的 MNIST 数据集所带来的挑战还不足以说明这些处理器的强大功能。要了解它们对深度学习(尤其是 CNN)的真正影响,我们必须将焦点转移到多伦多的 Hinton 实验室,Hinton 和两名研究生 Alex Krizhevsky 和 ​​Ilya Sutskever(Krizhevsky 是 GPU 编程高手,Sutskever 是一位有远见的人,他看到了大型深度神经网络的潜力)在那里构建了第一个大型 CNN。在当时,这是一个庞大的神经网络,经过高分辨率图像和大量图像的训练。通过它,Krizhevsky 和 ​​Sutskever 一劳永逸地证明了传统的图像识别方法永远无法赶上。该网络被称为 AlexNet。

But the use of GPUs to overcome the challenge posed by the relatively small MNIST dataset doesn’t begin to hint at the power of these processors. To understand their true impact, on deep learning in general and CNNs in particular, we have to shift focus to Hinton’s lab in Toronto, where Hinton and two graduate students, Alex Krizhevsky and Ilya Sutskever—Krizhevsky was a whiz at programming GPUs, and Sutskever was a visionary who saw the potential of large deep neural networks—built the first massive CNN. For its time, it was a gigantic neural network trained on high-res images and an immense number of them. With it, Krizhevsky and Sutskever showed once and for all that conventional methods for image recognition were never going to catch up. The network came to be called AlexNet.

亚历克斯网

ALEXNET

早在 AlexNet 问世之前,Hinton 和研究生 Volodymyr Mnih 就了解 GPU 的用处。两人正在研究在航拍图像中寻找道路。“在城市里,树木、停放的汽车和阴影密布,要找到道路并非易事,”Hinton 告诉我。但他们意识到,他们有充足的数据来帮助他们:可以访问其他类型的地图,这些地图上的道路标记清晰。这些地图被称为矢量地图,每张地图都存储为点、线和多边形的集合。与图像不同,矢量地图是使用存储的信息按需绘制的。Hinton 和 Minh 使用这些矢量地图中的信息来教神经网络如何适当地标记航拍图像的像素。(例如,像素是否属于道路?)这需要一个大型神经网络。(但它不是 CNN;团队决定不使用 CNN,因为担心池化层会破坏单个像素级别的空间位置信息。)大型网络需要 GPU;到那时,这些 GPU 配备了称为 CUDA 的软件,这是一种编程接口,允许工程师将 GPU 用于除作为图形加速器之外的通用任务。Mnih 在 CUDA 上编写了另一个软件包,名为 CUDAMat,以“轻松在支持 CUDA 的 GPU 上执行基本矩阵计算”。

Even before the advent of AlexNet, Hinton and a graduate student named Volodymyr Mnih understood the usefulness of GPUs. The two were working on the problem of finding roads in aerial images. “In cities, where there are trees and parked cars and shadows, it’s not trivial to find the roads,” Hinton told me. But they realized they had ample data to help them: access to other types of maps in which roads were clearly marked. These were so-called vector maps, and each map was stored as a collection of points, lines, and polygons. Unlike an image, a vector map is drawn upon demand, using the stored information. Hinton and Minh used information in these vector maps to teach neural networks how to appropriately label the pixels of an aerial image. (For example, does the pixel belong to a road or not?) This required a large neural network. (It wasn’t a CNN, though; the team decided against using CNNs because of concerns that a pooling layer would destroy information about spatial positions at the level of individual pixels.) The large network required GPUs; by then, these came equipped with software called CUDA, a programming interface that allowed engineers to use GPUs for general-purpose tasks beyond their intended use as graphics accelerators. Mnih wrote another package atop CUDA, called CUDAMat, to “make it easy to perform basic matrix calculations on CUDA-enabled GPUs.”

一年后,Hinton 的另外两名学生成功使用 CUDAMat 编程深度神经网络,在语音识别方面取得突破。GPU 显然对于释放这些网络的力量至关重要,但并不是每个人都意识到了这一点。Hinton 回忆说,他曾试图说服微软为一个共同项目购买 GPU,但微软拒绝了。Hinton 开玩笑地告诉微软,他的团队买得起 GPU,因为他的家族很富有加拿大大学,微软是一家“穷困潦倒的公司”,因此“买不起”也是可以理解的。这种讽刺很有效。“他们买了 GPU,”Hinton 告诉我。“然后他们试图用微软的软件来运行它们。所以……”

A year later, two other students of Hinton’s successfully used CUDAMat to program deep neural networks to make breakthroughs in speech recognition. GPUs were obviously crucial for unleashing the power of these networks, but not everyone recognized this. Hinton recalls trying to persuade Microsoft to buy GPUs for a common project, but Microsoft balked. Hinton told the company, tongue in cheek, that his team could afford GPUs because he was at a rich Canadian university and that Microsoft was a “poor impoverished company,” so it was understandable that it “couldn’t afford them.” The sarcasm worked. “They bought GPUs,” Hinton told me. “And then they tried to run them with Microsoft software. So…”

讽刺的是,Hinton 早期在道路航拍图像项目中使用常规非卷积深度神经网络取得了成功,这推迟了他的实验室进军卷积神经网络的步伐。“我们只是有点慢,但 Ilya [Sutskever] 意识到我们必须这样做,”他说。

Ironically, Hinton’s early success at using regular nonconvolutional deep neural networks for the roads-in-aerial-images project delayed his lab’s foray into convolutional neural networks. “We were just a bit slow, but Ilya [Sutskever] realized we had to do it,” he said.

2002 年,年仅 17 岁的 Sutskever 进入多伦多大学。不到一年,他就决定从事人工智能研究。“我想为人工智能做贡献,”他告诉我,“神经网络似乎是正确的。”为此,当他还在读本科二年级时,他敲开了 Hinton 的门。Sutskever 记得自己对 Hinton 的态度并不特别得体或礼貌。Hinton 给了 Sutskever 一些论文阅读。年轻的 Sutskever 被这些简单想法的力量所震惊。“我确实记得自己对整个事情的简单性感到困惑……怎么会这样?”他说。“你看看你的本科数学或物理课程,它们太复杂了。而这些东西却如此简单。你只需阅读两篇论文,就能理解如此强大的概念。怎么会这么简单?”

In 2002, Sutskever, who was barely seventeen, joined the University of Toronto. Within a year, he had decided to work on AI. “I wanted to contribute to AI,” he told me. “Neural networks seemed obviously correct.” To that end, when he was still in his second year of undergraduate studies, he knocked on Hinton’s door. Sutskever remembers not being particularly tactful or polite toward Hinton. Hinton gave Sutskever some papers to read. The young Sutskever was taken aback by the power of the simple ideas. “I definitely remember being perplexed by how simple the whole thing is…How can it be?” he said. “You look at your undergrad classes in math or physics, and they’re so complicated. And then this stuff is so simple. You just read two papers and you understand such powerful concepts. How can it be that it’s so simple?”

苏茨克弗拥有计算复杂性理论方面的背景,该理论研究计算机能做什么和不能做什么。“研究计算复杂性理论得出的一个结论是,某些计算模型比其他模型强大得多,”他说。“关于神经网络,有一点非常清楚,即它们符合强大计算模型的标准。它足够强大。”

Sutskever had a background in computational complexity theory, the study of what computers can and cannot do. “One of the things that comes up from looking at computational complexity theory is that certain computational models are much more powerful than others,” he said. “The thing that was very clear about neural networks is that [they] fit the bill of a powerful computational model. It was powerful enough.”

确实,2009 年出现了一个足以对神经网络提出质疑的问题。那一年,斯坦福大学教授李飞飞和她的学生在第一届计算机视觉和模式识别 (CVPR) 会议。论文题为“ImageNet:大规模分层图像数据库”,其中包含数百万张手工标记图像的庞大数据集,这些图像包含数千个类别(按 2009 年的标准来看数量庞大)。2010 年,该团队发起了 ImageNet 挑战赛:使用 120 万张 ImageNet 图像(分为 1,000 个类别)训练计算机视觉系统正确分类这些图像,然后在 100,000 张从未见过的图像上进行测试,看看系统识别它们的能力如何。该竞赛刚刚开始,因此以“品尝者竞赛” 与更为成熟的竞赛——2010 年 PASCAL 视觉对象类别挑战赛一起举行。

And indeed, a problem big enough to pose questions of neural networks appeared in 2009. That year, Stanford University professor Fei-Fei Li and her students presented a paper at the first Computer Vision and Pattern Recognition (CVPR) conference. Titled “ImageNet: A Large-Scale Hierarchical Image Database,” the paper included an immense dataset of millions of hand-labeled images consisting of thousands of categories (immense by the standards of 2009). In 2010, the team put out the ImageNet challenge: Use 1.2 million ImageNet images, binned into 1,000 categories, to train your computer vision system to correctly categorize those images, and then test it on 100,000 unseen images to see how well the system recognizes them. The contest was so new that it was conducted as a “taster competition” alongside a more established contest, the PASCAL Visual Object Classes Challenge 2010.

当时,标准计算机视觉仍占据主导地位。鉴于这一点,ImageNet 挑战赛为用户提供了所谓的尺度不变特征变换 (SIFT)。开发人员可以使用这些 SIFT 从图像中提取已知类型的低级特征,识别这些特征,并使用它们对图像进行分类。(神经网络可以自行找出重要特征,但当时并不在考虑范围内。)2010 年,NEC 和伊利诺伊大学香槟分校 (NEC-UIUC) 的一个团队赢得了挑战赛。他们的系统基本上使用 SIFT 将每幅图像转换为一个长向量。支持向量机学会了对这些向量进行分类,从而对图像进行分类。

Standard computer vision still ruled the roost then. In recognition of this, the ImageNet challenge provided users with so-called scale invariant feature transforms (SIFTs). Developers could use these SIFTs to extract known types of low-level features from images, recognize those features, and use them to categorize an image. (Neural networks, which could figure out the features of importance on their own, weren’t in the picture.) In 2010, a team from NEC and the University of Illinois Urbana-Champaign (NEC-UIUC) won the challenge. Their system essentially used the SIFTs to turn each image into a long vector. A support vector machine learned to categorize these vectors and, thus, classify the images.

与此同时,Sutskever 看到了不祥之兆。SVM 赢得了 2010 年的比赛,但在他看来,SVM 的潜力有限。神经网络才是未来。“如果你能搞清楚如何训练它们,如果你能获得数据,那么神经网络的上限就很高。而其他东西,比如支持向量机,无论你多么想研究它们,上限都很低……所以,你从一开始就注定要失败,”他告诉我。

Meanwhile, Sutskever saw the writing on the wall. SVMs had won the 2010 competition, but in his mind, they were limited. Neural networks were the future. “If you could figure out how to train them, if you could get the data, then the ceiling [for what neural networks could do] was high. Whereas the other stuff, [such as] support vector machines—it doesn’t matter how much you want to study them. The ceiling is low…So, you’re doomed from the get-go,” he told me.

突然间,数据不再是问题。ImageNet 数据集解决了这个问题暂时解决了。然而,训练仍然是一个问题。Sutskever 看到了 Hinton 团队使用 GPU 所做的工作,将两者结合起来,并鼓励 Hinton 构建一个可以使用 GPU 进行训练的卷积神经网络。“Ilya 是一个有远见的人。他有很好的直觉和很大的信心,”Hinton 告诉我。“是 Ilya 意识到我们使用的技术,加上 GPU 和这些新的学习算法,可以解决 ImageNet 问题。”贝尔实验室的 Yann LeCun 团队也意识到了这一点。“Yann 意识到了同样的事情。他试图让几个不同的研究生来做这件事,但他们都不愿意做,”Hinton 说。“这对我们来说是幸运的,因为 Ilya 意识到我们必须在别人之前做这件事。”

Data, suddenly, wasn’t an issue. The ImageNet dataset had solved that problem for the moment. Training, however, remained an issue. Sutskever saw the work Hinton’s team had done with GPUs, put two and two together, and pushed Hinton to build a convolution neural network that could be trained using GPUs. “Ilya is a visionary. He’s just got wonderful intuitions and lots of confidence,” Hinton told me. “It was Ilya who realized that the technology we were using, with GPUs and these new learning algorithms, would just solve ImageNet.” Yann LeCun’s group at Bell Labs was also onto it. “Yann realized the same thing. And he tried to get several different graduate students to do it, but none of them was willing to work on it,” Hinton said. “And that was lucky for us, because Ilya realized we really had to do it before anybody else did it.”

至关重要的帮助来自 Alex Krizhevsky 对 GPU 的神奇技能。“他比任何人都能更好地在 GPU 上编写卷积,”Hinton 说。Sutskever 和 Krizhevsky 是实验室同事。Krizhevsky 已经编写了 CUDA 代码,用于在名为 CIFAR(加拿大高级研究院)的较小图像数据集上训练支持 GPU 的神经网络。Sutskever 对代码印象深刻。他说服 Krizhevsky 对 ImageNet 做同样的事情。

Crucial help came in the form of Alex Krizhevsky’s wizardry with GPUs. “He could program convolutions on GPUs better than anybody else,” Hinton said. Sutskever and Krizhevsky were lab mates. Krizhevsky had already written CUDA code to train GPU-enabled neural networks on a smaller image dataset called CIFAR (the Canadian Institute for Advanced Research). Sutskever was impressed by the code. He convinced Krizhevsky to do the same for ImageNet.

因此,借助 GPU 的强大功能和海量数据(这两点是 LeCun 在 1989 年所没有的),Krizhevsky、Sutskever 和 Hinton 构建了 AlexNet,这是一个深度卷积神经网络,使用来自 ImageNet 数据集的 120 万张高分辨率图像进行训练,包含一千个类别。它有五个卷积层,其中一些输入到最大池化层。它有两层全连接神经元层,与我们之前的示例非常相似。最终的输出层有 1000 个神经元,每个图像类别一个。神经网络有超过 50 万个神经元和 6000 万个参数或权重,这些参数的值必须在训练过程中学习。还有其他更小但很重要的参数技术进步(例如,选择一种不同的神经元激活函数,称为整流线性单元ReLU,而不是S形函数)。

So, with the power of GPUs and an immense amount of data—two things LeCun didn’t have in 1989— Krizhevsky, Sutskever, and Hinton built AlexNet, a deep convolutional neural network trained on 1.2 million high-res images from the ImageNet dataset, consisting of a thousand categories. There were five convolutional layers, some of which fed into max pooling layers. There were two fully connected layers of neurons, much like in our earlier example. The final output layer had a thousand neurons, one for each category of image. The neural network had more than half a million neurons and 60 million parameters, or weights, whose values had to be learned during training. There were other, smaller but significant technical advances (for instance, a choice of a different activation function for neurons, called a rectified linear unit, ReLU, instead of the sigmoid function).

2012 年,李飞飞的团队公布了他们年度图像识别环境的比赛结果。这个多伦多团队的网络被恰当地称为 AlexNet,以巨大的优势获胜。AlexNet 可以对 ImageNet 测试数据集中的图像进行分类,前五名错误率低至 17%(前五名错误率是指图像的正确标签未出现在 ML 模型预测的前五个最可能标签中的百分比)。2010 年和 2011 年的冠军则落后很多,分别为 28% 和 26%。即使是 2012 年的亚军也落后了,为 26%。非神经网络系统几乎没有带来任何变化。深度神经网络终于实现了它们的承诺。Sutskever 被证明是正确的,因为即使在这三人开始研究 AlexNet 之前,他一直在宣传它。“我会到处走走,惹恼别人,”他告诉我。 “我会告诉他们深度学习将改变一切。”

In 2012, Fei-Fei Li’s team announced the results of their annual image recognition context. AlexNet, as the Toronto group’s network is rightfully called, won by a wide margin. AlexNet could classify images in the ImageNet test dataset with a top-5 error rate as low as 17 percent (the top-5 error rate refers to the percentage of times the correct label for an image does not appear in the top five most likely labels predicted by the ML model). The winners in 2010 and 2011 were way behind, at 28 and 26 percent, respectively. Even the runner-up in 2012 was lagging, at 26 percent. The non-neural network systems had barely moved the needle. Deep neural networks had finally lived up to their promise. Sutskever was vindicated, because even before the trio started working on AlexNet, he had been evangelizing. “I would go around and I would annoy people,” he told me. “I would tell them that deep learning is going to change everything.”

确实如此。AlexNet 只是一个开始。深度神经网络在各种任务上变得越来越大,越来越好:计算机视觉(包括面部和物体检测和识别等子专业的领域)、自然语言处理(允许机器用自己的类似人类的文本或语音响应人类生成的文本或语音)、机器翻译(将一种语言的文本翻译成另一种语言)、医学图像分析、金融数据中的模式检测等等。这个列表是无穷无尽的。

It has. AlexNet was just the start. Deep neural networks have gotten bigger and bigger, and better and better, at a wide range of tasks: computer vision (a field that subsumes subspecialties such as face and object detection and recognition), natural language processing (which allows machines to respond to human-generated text or voice with its own human-like text or voice response), machine translation (which takes text in one language and translates it into another), medical image analysis, pattern detection in financial data, and so much more. The list is endless.

从数学的角度来看,深度神经网络已经揭示了一个深奥的谜团:随着它们变得越来越大,标准机器学习理论一直在努力解释这些网络为何能如此有效。加州圣地亚哥的研究人员认为,深度神经网络正引领我们走向更全面的机器学习理论。他将机器学习研究的现状比作量子力学成熟的物理学时期。““一切都消失了,”他说。由于有关人工神经网络的经验数据,机器学习理论家也被迫接受类似的事情。贝尔金将经验主义者比作制图师:他们为理论家照亮了道路。我们的最后一章将让我们一睹这一令人兴奋的新领域。

Viewed through our mathematical lens, deep neural networks have thrown up a profound mystery: As they have gotten bigger and bigger, standard ML theory has struggled to explain why these networks work as well as they do. Mikhail Belkin of the University of California, San Diego, thinks that deep neural networks are pointing us toward a more comprehensive theory of machine learning. He likens the situation in ML research to the time in physics when quantum mechanics came of age. “Everything went out of the window,” he said. Something similar is being forced upon ML theorists, thanks to empirical data about artificial neural networks. Belkin compares empiricists to cartographers: They are illuminating the terrain for theorists to follow. Our final chapter will give us a glimpse of this exciting new terrain.

第十二章

CHAPTER 12

未知之地

Terra Incognita

深度神经网络几乎可以达到 ML 算法未曾达到的水平

Deep Neural Networks Go Where (Almost) No ML Algorithm Has Gone Before

2020 年的某个时候,总部位于旧金山的人工智能公司 OpenAI 的研究人员正在训练一个深度神经网络,以学习如何将两个数字相加。这些数字是二进制的,加法以 97 为模,这意味着两个数字的和总是在 0 到 96 之间。如果和超过 96,它就会回绕,就像钟面上的数字回绕一样。通过查看示例可以最好地说明这个想法。因此,两个数字的和可以写成:

Sometime in 2020, researchers at OpenAI, a San Francisco–based artificial intelligence company, were training a deep neural network to learn, among other things, how to add two numbers. The numbers were in binary, and the addition was modulo-97, meaning any sum of two numbers would always be between 0 and 96. If the sum exceeded 96, it’d wrap around, the way numbers wrap around on a clock face. The idea is best illustrated by looking at examples. So, any sum of two numbers can be written as:

总和= x + (97 的某个倍数),其中 0 ≤ x ≤ 96

sum = x + (some multiple of 97), where 0 ≤ x ≤ 96

因此,模 97 的和为:

So, the sum, modulo-97, is:

总和mod97 = x

summod97 = x

例如,假设您要将 22 和 28 相加:

For example, let’s say you want to add 22 and 28:

总和= 22 + 28 = 50 + (0 × 97)

sum = 22 + 28 = 50 + (0 × 97)

总和mod97 = 50

summod97 = 50

或者,你想将 40 和 59 相加:

Or, you want to add 40 and 59:

总和= 40 + 59 = 99 = 2 + (1 × 97)

sum = 40 + 59 = 99 = 2 + (1 × 97)

总和mod97 = 2

summod97 = 2

这看似一个微不足道的问题,但却是理解如何让人工智能进行分析推理的必要步骤。一名训练神经网络的团队成员去度假了,忘记停止训练算法。当他回来时,他惊讶地发现神经网络已经学会了加法的一般形式。就好像它对这个问题有了更深的理解,而不仅仅是记住它正在训练的数字集的答案。

It was a seemingly trivial problem, but a necessary step toward understanding how to get the AI to do analytical reasoning. A team member who was training the neural network went on vacation and forgot to stop the training algorithm. When he came back, he found to his astonishment that the neural network had learned a general form of the addition. It’s as if it had understood something deeper about the problem than simply memorizing answers for the sets of numbers on which it was being trained.

在偶然发现科学的悠久传统中,该团队偶然发现了深度神经网络的一个奇怪的新特性,他们称之为“grokking”,这个词是美国作家罗伯特·海因莱因在他的著作《异乡异客》中发明的。OpenAI 团队的一名成员 Alethea Power 告诉我:“Grokking 的目的不仅仅是理解,还要内化并成为信息。”他们的小型神经网络似乎已经理解了数据。

In the time-honored tradition of serendipitous scientific discoveries, the team had stumbled upon a strange, new property of deep neural networks that they called “grokking,” a word invented by the American author Robert Heinlein in his book Stranger in a Strange Land.Grokking is meant to be about not just understanding, but kind of internalizing and becoming the information,” Alethea Power, a member of the team that did the work at OpenAI, told me. Their small neural network had seemingly grokked the data.

Grokking 只是深度神经网络表现出的众多奇怪行为之一。(我们将在本章后面更详细地讨论它。)另一个与这些网络的大小有关。当今最成功的神经网络,无论是进行图像或语音识别还是自然语言处理,都是庞然大物:它们有数亿或数十亿个权重或参数;甚至可能是一万亿。这些参数有时可以等于或大大超过用于训练这些网络的数据实例。标准 ML 理论认为,这样的网络不应该起作用他们的做法是:他们应该只是过度拟合数据,而无法对新的、未见过的数据做出推断或“概括”。

Grokking is just one of many odd behaviors demonstrated by deep neural networks. (We’ll look at it in more detail later in this chapter.) Another has to do with the size of these networks. The most successful neural networks today, whether they are doing image or speech recognition or natural language processing, are behemoths: They have hundreds of millions or billions of weights, or parameters; maybe even a trillion. The parameters can at times equal or vastly outnumber the instances of data used to train these networks. Standard ML theory says that such networks shouldn’t work the way they do: They should simply overfit the data and fail to make inferences about, or “generalize to,” new, unseen data.

Power 用一个例子来说明这个问题。考虑一些家具类型的图像数据集。数据集分为两部分:训练和测试。假设训练数据集中有四条腿的椅子的图像,但只有那些由金属或木头制成的椅子。这些被标记为“椅子”。也有沙发的图像,但没有任何可辨别的腿。这些被标记为“非椅子”。机器学习算法的任务是将图像分类为“椅子”或“非椅子”。算法训练完成后,将根据测试数据集测试其性能。碰巧的是,测试数据集还包含由塑料制成的椅子,而不仅仅是木头和金属制成的椅子。算法的表现如何?

Power used an example to illustrate this problem. Consider a dataset of images of some types of furniture. The dataset is broken into two parts: training and test. Let’s say that the training dataset has images of four-legged chairs, but only those made of metal or wood. These are labeled “chair.” There are images of sofas, too, but without any discernible legs. These are labeled “not-chair.” The task of the ML algorithm is to classify an image as either a “chair” or a “not-chair.” Once the algorithm is trained, its performance is tested against the test dataset. As it happens, the test dataset also contains chairs made of plastic, not just of wood and metal. How might the algorithm perform?

可能发生的情况如下。如果正在训练的 ML 模型非常复杂,具有大量参数,那么它可能会了解到椅子除了有四条腿之外的其他信息。它可能会了解到椅子是由木头或金属制成的。这样的模型非常适合识别金属或木制椅子,但可能无法识别测试数据中的塑料椅子。可以说,该模型过度拟合了训练数据,捕捉到了不必要的、甚至可能对当前任务有害的椅子细微差别。一个参数较少的简单模型可能只会捕捉到椅子有四条腿的模式;然后,它会更好地推广到看不见的数据,并可能识别出由木头和金属以外的材料制成的椅子。

Here’s what might happen. If the ML model being trained is very complex, with a very large number of parameters, it might learn things about chairs that go beyond the fact that these chairs have four legs. It might learn that they are made of wood or metal. Such a model would be very good for recognizing metal or wooden chairs, but it might fail to identify the plastic chairs in the test data. The model can be said to have overfit the training data, picking up on nuances of chairs that were unnecessary and maybe even detrimental to the task at hand. A simpler model, with fewer parameters, might have picked up only on the pattern that chairs have four legs; it would then have generalized better to unseen data and possibly have recognized chairs made of materials besides just wood and metal.

我们可以用一个更简单的数据集来可视化这个问题。让我们在 xy 平面上取一些点。我们的任务是训练一个 ML 模型来执行回归,找到一条适合训练数据的曲线,以便模型能够很好地推广到看不见的数据。首先,这里有一些训练数据,然后是三个不同的模型或曲线,用于拟合数据:

We can visualize this issue with a much simpler dataset. Let’s take some points on the xy plane. Our task is to train an ML model to perform regression, to find a curve that fits the training data in a way that will allow the model to generalize well to unseen data. First, here’s some training data, followed by three different models, or curves, to fit the data:

我们能做的最简单的事情就是线性回归:通过数据拟合一条直线。这是一个简单的模型。几乎没有任何训练数据落在这条线上,因此该模型几乎会在训练数据的每个实例上产生或大或小的误差。

The easiest thing we can do is linear regression: Fit a straight line through the data. It’s a simple model. Almost none of the training data falls on the line, so the model will make errors, small and large, on nearly every instance of the training data.

如果我们通过添加更多参数和一些非线性来增加模型的复杂性,我们可能会找到一条更忠实地拟合训练数据的曲线。现在,模型在训练数据上出错的风险降低了:曲线实际上会经过一些数据点,但不是全部,而它错过的数据点将导致训练错误。

If we increase the complexity of our model, by adding more parameters and some nonlinearity, we might find a curve that fits the training data a little more faithfully. Now the model’s risk of making errors on the training data is reduced: The curve actually passes through some of the data points, though not all of them, and the ones it misses will contribute to the training error.

最后一张图展示了一个具有更多参数的复杂非线性模型:曲线穿过每个数据点。训练误差几乎为零。

The final panel shows a complex nonlinear model with considerably more parameters: The curve passes through each data point. The training error is almost zero.

我们应该选择哪种模型?这不是一个容易回答的问题。选择取决于模型在您预留的测试数据上的表现。假设测试数据和训练数据来自相同的基础分布。(在椅子的例子中,可以说测试数据不属于分布,因为它们有塑料椅子的图像,而训练数据没有。但这些都是有争议的问题。如果算法的任务是对椅子进行分类,那么椅子的材质就不重要了——因此,塑料椅子应该被视为来自相同的基础分布。)

Which model should we choose? This is not an easy question to answer. The choice depends on the performance of the model on test data you have set aside. The test and training data are assumed to be drawn from the same underlying distribution. (In the example with the chairs, the test data can be said to be out of distribution, because they have images of plastic chairs, while the training data have none. But these are contentious issues. If the algorithm’s task is to classify chairs, it shouldn’t matter what they are made of—so, plastic chairs should be considered to be drawn from the same underlying distribution.)

假设穿过数据集的直线或多或少是正确的拟合,而训练数据散布在直线周围是因为数据中有噪声。在最简单的线性模型中,直线并没有拟合噪声;相反,它忽略了噪声。但由于测试数据来自同一分布,并且可能具有类似的噪声,因此最简单的线性模型在测试数据上的表现也会很差;测试数据出现错误的风险很高。

Let’s say that the straight line through the dataset is more or less the correct fit and that the fact that the training data are scattered about the straight line is because there’s noise in the data. In the case of the simplest linear model, the straight line hasn’t fit the noise; rather, it has ignored the noise. But because the test data are drawn from the same distribution and are presumably similarly noisy, the simplest linear model will do badly on the test data, too; there’s a high risk of error on the test data.

然而,最复杂的模型显然会跟踪训练数据中的每一个细微变化:它本质上过度拟合了数据,如果数据的变化是由于噪声引起的,那么模型就会学习到这种噪声的细节。然而,复杂模型对测试数据的预测也会严重错误,因为测试数据同样充满噪声。复杂模型将根据它所学习到的极其弯曲的回归曲线进行预测,该曲线特定于训练数据中的噪声,但由于噪声是随机的,曲线不会像跟踪训练数据那样跟踪测试数据的实例,从而导致严重的测试误差。

The most complex model, however, is clearly tracking every little variation in the training data: It has essentially overfit the data, and if the variations in the data are because of noise, then the model has learned the minutiae of this noise. The complex model will also, however, get its predictions on the test data badly wrong, because the test data are similarly noisy. The complex model will make predictions based on the extremely squiggly regression curve it has learned, which is specific to the noise in the training data, but because noise is random, the curve won’t track the instances of test data as well as it did the training data, leading to significant test error.

我们以回归为例,但这个问题也困扰着分类问题,这涉及到找到一些线性或非线性的边界来区分数据聚类。如果模型如果模型过于简单,分类边界就会与数据的实际变化不太吻合:你无法将训练误差和测试误差降到可接受的限度。如果模型过于复杂,分类边界就会跟踪数据中的每一个小偏差,从而产生过度拟合,在训练数据上会表现得很好,但在测试过程中可能会产生巨大的分类错误。

We have used regression as an example, but the issue also dogs the problem of classification, which involves finding some linear or nonlinear boundary that separates clusters of data. If the model is too simple, it’ll find a boundary that doesn’t quite hew to the real variations in the data: You cannot get the training error and test error down to acceptable limits. If the model is too complex, the classification boundary will track every little deviation in the data, overfitting them, and will do really well on the training data but will likely make huge classification errors during testing.

金发姑娘原则

THE GOLDILOCKS PRINCIPLE

这些似乎只是吹毛求疵的机器学习从业者才会担心的理论问题,但让我们从第 6 章中举一个例子来看一下为什么这个问题可能真的事关生死。下图显示了对单个麻醉状态下被监测的人收集的真实脑电图数据进行主成分分析的结果。灰点代表人有意识时的两秒间隔;黑色三角形代表人无意识时的两秒间隔。这些是训练数据。

These might seem like theoretical concerns that would bother only nitpicking ML practitioners, but let’s take an example from chapter 6 to see why this issue might be literally a matter of life and death. Here’s the graphic showing the result of principal component analysis done on real EEG data collected from a single person being monitored while under anesthesia. The gray dots represent two-second intervals when the person is conscious; the black triangles are for when the person is unconscious. These are the training data.

想象一下构建一个简单的分类器,它学会区分意识状态和潜意识状态。你可以找到一条直线(a一个非常简单的模型(一个试图将两个簇分开的模型)试图将两个簇分开。由于簇之间没有明显的空间,我们找到的任何直线都会在训练过程中产生错误:黑色三角形簇中总会有一些灰点,反之亦然。为了减少这种训练错误,我们可以找到一些弯曲的曲线,仔细地映射两个簇之间的边界轮廓。例如,您可以使用 k 最近邻算法(第 5 章)k。您将得到一个高度非线性的边界。可以构建这样的模型以最大限度地减少训练过程中的错误。

Imagine building a simple classifier that learns to separate the conscious from the unconscious states. You can find a straight line (a very simple model) that tries to separate the two clusters. Because there’s no clear space between the clusters, any straight line we find will make errors during training: There will always be a few gray dots in the cluster of black triangles, and vice versa. To reduce such training errors, we can find some squiggly curve that carefully maps the contours of the boundary between the two clusters. For example, you could use a k-nearest neighbor algorithm (chapter 5) k. You’d get a highly nonlinear boundary. Such a model could be built to minimize the errors during training.

无论你选择哪种模型,它都必须根据新的脑电图信号预测意识状态,以帮助麻醉师确定麻醉剂量,以便患者保持或恢复意识。显然,ML 模型的推荐在这里很重要。如果做错了,后果会很严重。你会如何找到最好的模型?

Whichever model you choose, it has to predict the state of consciousness given some new EEG signal, to help the anesthesiologist determine the dosage of anesthesia, so that the person can be kept under or brought out of unconsciousness. Obviously, the recommendation of the ML model is important here. Getting it wrong has consequences. How would you go about finding the best possible model?

回想一下第 6 章,EEG 研究收集了 10 名患者的数据。研究人员使用 7 名患者的数据来训练他们的模型并构建分类器,并保留 3 名患者的数据来测试他们的模型。假设我们针对三名患者的测试数据测试一个非常简单的线性模型和一个高度复杂的非线性模型。简单的线性模型在训练过程中会犯更多错误,因为它与训练数据不符,并且在测试数据上也会出错。复杂的非线性模型虽然在训练数据上几乎不会犯任何错误,但在测试过程中会犯重大错误,正是因为它记住了仅来自 7 名独立参与者的训练数据。

Recall from chapter 6 that the EEG study had collected data from ten patients. The researchers used data from seven patients to train their model and build the classifier, and they kept aside data from three patients to test their model. Let’s say we test an extremely simple, linear model and also a highly complex, nonlinear model against the three-patient test data. The simple linear model will make more errors during training because it underfits the training data and makes errors on the test data as well. The complex nonlinear model, while it makes near-zero errors on the training data, will make significant errors during testing precisely because it memorized the training data, which was only taken from seven independent participants.

为什么测试误差很重要?测试数据是我们手头有的,但在训练期间没有提供给 ML 算法。训练后的模型在测试数据上的表现如何是我们唯一的表明其在野外的潜在性能,即其推广到真正未见过的数据的能力。我们希望测试误差尽可能低。

Why is the test error important? The test data are something we have in hand but have withheld from the ML algorithm during training. How well the trained model does on test data is our only indication of its potential performance out in the wild, meaning its ability to generalize to truly unseen data. We want the test error to be as low as possible.

既然如此,那么如何选择正确的模型复杂度呢?这引出了我们在这里要讨论的两种相互竞争的力量。一种叫做偏差:模型越简单,偏差越大。另一种叫做方差:模型越复杂,方差越大。

Given that, how does one choose the correct level of complexity of the model? This leads us to the two competing forces at work here. One is called bias: The simpler the model, the greater the bias. The other is called variance: The more complex the model, the greater the variance.

高偏差(即较简单的模型)会导致欠拟合、较高的训练误差风险和较高的测试误差风险,而高方差(即较复杂的模型)会导致过拟合、较低的训练误差风险和较高的测试误差风险。机器学习工程师的工作是找到最佳平衡点。如果将机器学习模型中的参数数量或可调旋钮的数量作为模型复杂性或容量的衡量标准,则标准机器学习理论认为模型应该具有恰当数量的参数:参数太少,模型就会过于简单(高偏差),无法捕捉到训练数据的必要细微差别。参数太多,模型就会变得非常复杂,并且会以非常细的粒度学习数据中的模式,以至于无法推广到未见过的数据。““这是一种平衡,既要让数据拟合得非常好,又要让数据拟合得一点也不好。你需要找到一个中间点,”加州大学圣地亚哥分校的机器学习专家米哈伊尔·贝尔金 (Mikhail Belkin) 说道。“这里有某种金发姑娘原则。不能太热,也不能太冷。”

High bias (i.e., simpler models) leads to underfitting, a higher risk of training error, and a higher risk of test error, whereas high variance (i.e., more complex models) leads to overfitting, a lower risk of training error, and a higher risk of test error. The job of an ML engineer is to find the sweet spot. If the number of parameters, or tunable knobs, that are there in an ML model is taken as a measure of the model’s complexity or capacity, then standard ML theory says that a model should have just the right number of parameters: Too few, and the model is too simple (high bias) and fails to capture the necessary nuances of the data on which it’s trained. Too many parameters, and the model becomes very complex and learns the patterns in the data with such fine granularity that it fails to generalize to unseen data. “It’s a balance between somehow fitting your data too well and not fitting it well at all. You want to be in the middle,” said Mikhail Belkin, the machine learning expert at the University of California at San Diego. “There is some sort of Goldilocks principle. Not too hot, not too cold.”

下面描绘的是机器学习中最著名的一组曲线。它描绘了偏差-方差权衡(请参阅 Scott Fortmann-Roe 关于该主题的精彩博客文章)。这里发生了许多事情。让我们从 x 轴开始:较低的值表示具有较少参数的低容量模型,而较高的值意味着具有大量参数的更复杂模型参数。y 轴表示模型在训练或测试时出错的风险。

Depicted below is one of the most famous sets of curves you’ll see in machine learning. It depicts the bias-variance trade-off (see Scott Fortmann-Roe’s excellent blog post on the topic). A number of things are happening here. Let’s start with the x-axis: Lower values mean low-capacity models with fewer parameters, and higher values imply more complex models with a large number of parameters. On the y-axis is the risk that the model makes errors, either while training or testing.

简单介绍一下 x 轴上的标签“假设类容量”,以及它与模型复杂度和参数数量的关系。给定一组可以在训练期间调整值的参数,我们必须回到监督学习的基础知识,以思考这些参数使 ML 模型能够做什么。

A quick word about the label on the x-axis, “Capacity of Hypothesis Class,” and its relation to the complexity of a model and the number of parameters. Given a set of parameters whose values can be tuned during training, we must go back to the basics of supervised learning to reflect on what these parameters enable an ML model to do.

假设我们有(输入,输出)对形式的训练数据:

Let’s say we have training data in the form of (input, output) pairs:

[(x1y1),(x2y2),…,(xnyn)]

[(x1, y1), (x2, y2),…, (xn, yn)]

利用训练数据,我们必须找到一些函数f,在给定一些输入x 的情况下预测y

Using the training data, we have to find some function, f, that predicts y, given some input, x.

y = fx

y = f(x)

模型中可调参数的数量决定了可以用该参数实现的假设函数集模型。本质上,我们只能从该集合中找到解决方案。假设我们想要一个线性模型。在二维情况下,线性模型是由其斜率和与原点的偏移量定义的一条线:因此,我们需要两个参数。但如果您的模型只有一个参数怎么办?那么您将被迫找到一个更简单的函数,其中斜率是固定的,或者偏移量是固定的,从而限制了您的选择。

The number of tunable parameters in a model determines the hypothetical set of functions that can be implemented using that model. In essence, we are limited to finding a solution from that set. Say we want a linear model. In the 2D case, a linear model is a line defined by its slope and its offset from the origin: So, we need two parameters. But what if your model had only one parameter? Then you’d be forced to find a simpler function in which either the slope was fixed or the offset was fixed, thus limiting your options.

在非线性模型中,参数数量越多,我们通常可以使用的函数就越复杂。这应该让人想起第 9 章中的通用近似定理:只要有足够多的神经元,即使只有一个隐藏层的神经网络也可以近似任何函数,这意味着它的函数工具箱原则上是无限大的。因此,参数的数量(在神经网络的情况下,相当于权重的数量,或神经元之间的连接,其值是在训练过程中学习到的)是模型复杂性的代表,并决定了可以访问的函数集以找到好的函数。这也可以称为假设类:增加参数的数量,就可以增加假设类的容量。

And in a nonlinear model, the greater the number of parameters, the squigglier the functions we usually have at our disposal. This should bring to mind the universal approximation theorem from chapter 9: Given enough neurons, a neural network with even just a single hidden layer can approximate any function, which implies that its toolbox of functions is, in principle, infinitely large. So, the number of parameters (which, in the case of neural networks, is equivalent to the number of weights, or the connections between neurons, whose values are learned during training) is a proxy for the complexity of the model and determines the set of functions one can access in order to find a good one. This can also be called the hypothesis class: Increase the number of parameters, and you increase the capacity of the hypothesis class.

让我们回到对面的图。有一条虚线。对于简单模型,它从 y 轴的高点开始,随着模型复杂度的增加趋向于零。这条曲线表示训练风险,即模型在训练数据集上出错的风险。很明显,极其简单的模型在训练数据上表现不佳,因为它们对数据的拟合不足,而随着模型变得越来越复杂,它们开始过度拟合,因此训练风险趋向于零。

Let’s go back to the figure opposite. There is one dashed curve. It starts off high on the y-axis for simple models and goes toward zero as the model’s complexity increases. This curve represents the training risk, the risk that the model makes errors on the training dataset. It’s clear that extremely simple models do badly on the training data, because they are underfitting the data, and as the models get more complex, they start overfitting, hence the training risk goes to zero.

实线表示测试过程中出错的风险。对于高偏差、低复杂度的模型,它在 y 轴上从高处开始,下降到某个最小值,然后再次开始上升。碗底是我们想要的 ML 模型的位置:它代表了模型拟合不足和过度拟合之间的最佳平衡简单性和复杂性。这是金发姑娘区。选择一个最小化测试错误风险的模型可以最大化模型对更多未见数据(模型在野外会遇到的数据,因为它既不是训练数据也不是测试数据的一部分)的泛化能力。因此,最小化测试错误意味着最小化泛化错误,或最大化泛化能力。

The solid curve represents the risk of error during testing. It starts off high on the y-axis for high-bias, low-complexity models, descends to some minimum, and then starts rising again. The bottom of the bowl is where we want our ML model to be: It represents the optimal balance between underfitting and overfitting, between model simplicity and complexity. This is the Goldilocks zone. Choosing a model that minimizes the risk of test error maximizes the model’s ability to generalize to further unseen data (data the model will encounter in the wild, so to say, as it’s not part of either the training or the test data). So, minimizing test error implies minimizing generalization error, or maximizing the ability to generalize.

从几乎所有传统机器学习的经验来看,这个故事似乎都是真的。然后深度神经网络进入了竞争,颠覆了这一传统观点。相对于训练数据的实例,深度网络的参数太多了:它们被认为参数化过度;它们应该过度拟合,并且不能很好地推广到看不见的测试数据。然而它们确实如此。标准 ML 理论已无法充分解释深度神经网络为何如此有效。

From almost all empirical accounts of traditional machine learning, this story seemed to be true. Then deep neural networks entered the fray and turned this conventional wisdom upside down. Deep nets have way too many parameters relative to the instances of training data: They are said to be over-parameterized; they should overfit and should not generalize well to unseen test data. Yet they do. Standard ML theory can no longer adequately explain why deep neural networks work so well.

神经网络的难以忍受的陌生感

THE UNBEARABLE STRANGENESS OF NEURAL NETWORKS

2011 年,AlexNet 宣布进入机器学习领域几年后,芝加哥丰田技术学院的 Behnam Neyshabur、Ryota Tomioka 和 Nathan Srebro 对深度神经网络做出了一个有趣的观察。他们对只有一个隐藏层的网络进行了实验,结果发现,与预期相反,增加隐藏层中的神经元或单元数量(从而增加模型的容量)不会导致网络过度拟合训练数据。这三人正在两个标准图像数据集上测试他们的网络,其中一个是手写数字的 MNIST 数据集。首先,随着网络规模的增加,训练和测试误差都如预期一样下降。但是,随着网络规模的增加,训练误差接近零,根据偏差-方差权衡曲线,测试误差(或泛化误差)应该开始增加。但事实并非如此。他们在 2015 年的论文中用语表明了他们的怀疑:

A few years after AlexNet announced itself on the machine learning stage in 2011, Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro of the Toyota Technological Institute at Chicago made an intriguing observation about deep neural networks. Experimenting with networks that had just one hidden layer, they found, contrary to expectation, that increasing the number of neurons, or units in the hidden layer (thus the capacity of the model), did not cause the network to overfit the training data. The trio was testing their networks on two standard image datasets, one of which was the MNIST dataset of handwritten digits. First, as they increased the size of the network, both training and test error decreased as expected. But then, as the network increased in size and the training error approached zero, as per the bias-variance trade-off curve, the test error (or generalization error) should have started increasing. That’s not what they saw. The language in their 2015 paper is suggestive of their disbelief:

更令人惊讶的是,如果我们将网络规模增加到超过实现零训练误差所需的规模,测试误差会继续下降!将学习视为由网络规模控制的假设类拟合,这种行为根本无法预测,甚至与之相反。例如,对于 MNIST,32 个单元足以实现零训练误差。当我们允许更多单元时,网络不会更好地拟合训练数据……但是,测试误差会下降。事实上,随着我们添加越来越多的参数,甚至超过训练示例的数量,泛化误差不会增加。
More surprising is that if we increase the size of the network past the size required to achieve zero training error, the test error continues decreasing! This behavior is not at all predicted by, and even contrary to, viewing learning as fitting a hypothesis class controlled by network size. For example for MNIST, 32 units are enough to attain zero training error. When we allow more units, the network is not fitting the training data any better…However, the test error goes down. In fact, as we add more and more parameters, even beyond the number of training examples, the generalization error does not go up.

公平地说,关于深度神经网络这种行为的暗示早于这项研究。但 Neyshabur 和同事是第一个系统地测试它的人。他们走得更远。如果你故意在数据集中引入噪音会怎么样?

To be fair, hints about such behavior of deep neural networks preceded this work. But Neyshabur and colleagues were the first to systematically test it. They went further. What if you deliberately introduced noise into your dataset?

以 MNIST 数据集中的图像为例。每幅图像都有一个相关标签:“五”表示数字 5,“六”表示数字 6,依此类推。从这些图像中抽取 1%,并随机打乱它们的标签。因此,数字 5 的一个实例可能会被错误标记为“四”,数字 9 的一个实例可能会被错误标记为“二”,依此类推。现在将数据集分为训练数据和测试数据,并训练你的神经网络,使其在训练数据上实现零训练误差。这是什么意思?因为我们有意在数据中引入了噪声,所以网络(因为它在训练数据上没有犯任何错误)正在适应噪声。例如,它正在学习将那些错误标记的数字 5 和 9 分别标记为“四”和“二”。完美拟合数据。学习理论有一个生动的短语来描述这种模型:它被称为“粉碎”训练数据。

Take the images in the MNIST dataset. Each image has an associated label: “five” for the digit 5, “six” for the digit 6, and so on. Take 1 percent of these images and randomly scramble their labels. So, one instance of digit 5 might get mislabeled as “four,” an instance of digit 9 might get mislabeled as “two,” and so on. Now divide the dataset into training and test data and train your neural network such that it achieves zero training error on the training data. What does this mean? Well, because we have intentionally introduced noise into the data, the network—as it’s making no errors on the training data—is accommodating the noise. For example, it’s learning to label those mislabeled digits 5 and 9 as “four” and “two,” respectively. It’s fitting the data perfectly. Learning theory has an evocative phrase to describe a model that does this: It’s said to “shatter” the training data.

而且由于它完美地拟合了嘈杂的训练数据,该模型在测试数据上的表现应该不会很好。(直观地说,该模型学习到的波浪曲线非常特定于它遇到的噪声,没有理由期望该模型具有泛化能力。)但事实并非如此。Neyshabur 及其同事写道,“即使有 5% 的随机标签,也不会出现明显的过度拟合,而且随着网络规模增加到超过实现零训练误差所需的规模,测试误差会持续减少。”他们困惑地问道:“这是怎么回事?”

And because it fits the noisy training data perfectly, this model should not do well on the test data. (Intuitively, the squiggly curve that the model has learned is very specific to the noise it has encountered, and there’s no reason to expect the model to generalize.) But that’s not what happened. Neyshabur and colleagues write, “Even with five percent random labels, there is no significant overfitting and test error continues decreasing as network size increases past the size required for achieving zero training error.” And they ask, perplexed, “What is happening here?”

他们的困惑是有道理的。按照标准思维方式,只有当随机梯度下降过程(用于训练网络)以某种方式最终削减了模型中可调旋钮的数量时,这种行为才有可能。有几种方法可以明确地做到这一点;这是一个称为正则化的过程,它本质上将一个复杂的模型变成了一个更简单的模型,使其能够更好地泛化。作者总结道,也许随机梯度下降正在进行一些隐式的正则化,从而降低了网络的容量并使模型更简单,以避免过度拟合。

Their perplexity was warranted. This behavior would be possible, in the standard way of thinking, only if the process of stochastic gradient descent (which was used to train the network) somehow ended up pruning the number of tunable knobs in the model. There are ways in which you can do this explicitly; it’s a process called regularization, which essentially turns a complex model into a simpler one, allowing it to generalize better. Maybe stochastic gradient descent was doing some implicit regularization, thus reducing the capacity of the network and making the model simpler in order to avoid overfitting, the authors concluded.

随后,在 2016 年,当时就职于麻省理工学院的张驰远与加州大学伯克利分校的本·雷希特以及谷歌的同事在对大型数据集进行训练的大型神经网络中表现出了同样的行为。在他们题为《理解深度学习需要重新思考泛化》的论文中,他们总结道:“我们进行的实验强调,几个成功的神经网络架构的有效容量足以打破训练数据。因此,这些模型原则上足够丰富,可以记忆训练数据。这种情况对统计提出了概念挑战学习理论作为模型复杂性的传统度量标准,难以解释大型人工神经网络的泛化能力。”

Then, in 2016, Chiyuan Zhang, who was then at MIT, along with Ben Recht of the University of California, Berkeley, and colleagues at Google showed much the same behavior in larger neural networks trained on larger datasets. In their paper, provocatively titled “Understanding Deep Learning Requires Rethinking Generalization,” they concluded, “The experiments we conducted emphasize that the effective capacity of several successful neural network architectures is large enough to shatter the training data. Consequently, these models are in principle rich enough to memorize the training data. This situation poses a conceptual challenge to statistical learning theory as traditional measures of model complexity struggle to explain the generalization ability of large artificial neural networks.”

2017 年的情况就是这样的,当时加州大学伯克利分校西蒙斯计算理论研究所组织了为期三个月的机器学习理论基础项目。雷希特就他们关于重新思考泛化的论文发表了演讲。这引起了项目参与者的热烈讨论。深度神经网络带来的问题变得更加清晰。这些网络有能力插入数据(即完美拟合训练数据),同时还能对测试数据做出准确的预测。更成问题的是,这些网络可以插入噪声数据,但预测精度并没有像预期的那样下降。“我们经常教本科生不要... 太过拟合数据,否则预测准确率会很低。这是一直被接受的广泛原则之一,而我们在这里做的恰恰相反,这是可以接受的,”加州大学伯克利分校教授、西蒙斯研究所机器学习研究主任彼得·巴特利特 (Peter Bartlett) 告诉我。“这是一件令人震惊的事情。”

This was the state of play in 2017, when the Simons Institute for the Theory of Computing at UC Berkeley organized its three-month program on the theoretical foundations of machine learning. Recht gave a talk on their paper on rethinking generalization. And this led to considerable discussion among the program participants. The problem being posed by deep neural networks had become clearer. These networks had the capacity to interpolate the data (meaning fit the training data perfectly) and yet make accurate predictions on test data. Even more problematic, these networks could interpolate noisy data, and still the prediction accuracy didn’t degrade as expected. “We routinely teach our undergraduates [that] you don’t want to…get too good a fit to the data [or] you’ll have poor predictive accuracy. That’s one of those broad principles that’s always been accepted, and here we are doing the opposite and it’s okay,” UC Berkeley professor Peter Bartlett, research director for Machine Learning at the Simons Institute, told me. “It’s a shocking thing.”

米哈伊尔·贝尔金 (Mikhail Belkin) 参加了该项目。“当时每个人都非常困惑,”他回忆道。另一位机器学习专家、卡内基梅隆大学的鲁斯兰·萨拉克胡特迪诺夫 (Ruslan Salakhutdinov) 讲授了深度学习教程。贝尔金回忆说,萨拉克胡特迪诺夫说:“从实际角度来看,解决这个问题的最佳方法是建立一个非常大的系统……基本上你要确保达到零训练误差。” 同样,这是一个违背标准学习理论的断言。贝尔金大吃一惊。“对我来说,这真是……大开眼界,”他告诉我。“我当时想,‘他到底在说什么?我们为什么要完全拟合数据?’”

Mikhail Belkin was attending the program. “Everybody was super confused at the time,” he recalled. Another ML expert, Carnegie Mellon University’s Ruslan Salakhutdinov, gave a tutorial on deep learning. Belkin recalled Salakhutdinov as saying, “The best way to solve the problem from a practical standpoint is you build a very big system…Basically you want to make sure you hit the zero training error.” Again, it was an assertion that went against standard learning theory. Belkin was flabbergasted. “That for me was…eye-opening,” he told me. “I was like, ‘What the hell is he talking about? Why should we fit the data exactly?’ ”

但 Belkin 很快意识到,他当时所在的俄亥俄州立大学实验室正在进行的核方法研究(我们在第 7 章中遇到的)已经暗示了类似的事情。“我们一直在用内核进行这个实验。我们观察到,你可以训练并得到零损失或很小的损失,它仍然有效,”现在在加州大学圣地亚哥分校的贝尔金说。

But Belkin soon realized that work going on in his lab at Ohio State University, where he was at the time, on kernel methods (which we encountered in chapter 7) was already hinting at something similar. “We had been doing this experiment with kernels. And we observed that you can train and get zero or a small loss and it [still] worked,” said Belkin, who is now at the University of California, San Diego.

事实证明,标准学习理论(以及偏差-方差权衡)并非在所有情况下都适用的线索一直在慢慢积累。例如,加州大学伯克利分校的统计学家 Leo Breiman,1995 年,她写了一篇名为《NIPS 审阅论文后的反思》的论文。NIPS 代表“神经信息处理系统”,是该领域旗舰会议的名称。(如今,它被称为 NeurIPS。会议董事会成员在杰出女性 ML 专家的请愿下做出了这一改变,她们认为“首字母缩略词鼓励性别歧视并且是一种诽谤”,并列举了旧名称在性别歧视双关语中使用的例子。)在他的论文中,Breiman 问道,“为什么高度参数化的神经网络不会对数据产生过拟合?”此外,1998 年,Bartlett 及其同事证明了尽管模型很复杂,但名为 AdaBoost 的机器学习算法也不会过度拟合。

As it turns out, clues that standard learning theory (and the bias-variance trade-off) didn’t hold in all situations had been slowly accumulating. For example, Leo Breiman, a statistician at UC Berkeley, wrote a paper in 1995 called “Reflections After Refereeing Papers for NIPS.” NIPS stands for “neural information processing systems,” and was the name of the flagship conference in the field. (These days, it’s called NeurIPS. The conference’s board members made the change following a petition by prominent women ML experts, who made the case that the “acronym encourages sexism and is a slur” and provided examples of the old name’s being used in sexist puns.) In his paper, Breiman asked, “Why don’t heavily parameterized neural networks overfit the data?” Also, in 1998, Bartlett and colleagues had shown that an ML algorithm called AdaBoost also didn’t overfit, despite the model’s complexity.

在西蒙斯研究所的三个月逗留期间,贝尔金受到了辩论的鼓舞,开始系统地研究核方法和深度神经网络,主要是通过增加训练数据集中的噪声量来探究它们的性能。模型的复杂性或容量足以插入噪声数据。即使噪声影响了大约 5% 或更多的数据集,核机器和神经网络的性能也没有像预期的那样下降。“随着噪音水平的增加,没有什么会真正受到影响,”贝尔金说。

Energized by the debates during the three-month sojourn at the Simons Institute, Belkin embarked on a systematic study of kernel methods and deep neural networks, essentially probing their performance with increasing amounts of noise in the training dataset. The complexity, or capacity, of the models was enough to interpolate the noisy data. Even when the noise affected about 5 percent of the dataset or more, the performance of both kernel machines and neural networks didn’t degrade as expected. “As you increase the noise level, nothing really breaks,” Belkin said.

与此同时,巴特利特和他的同事开始探索这种现象,并给它起了一个诱人的名字:“良性过度拟合”。但其他人称之为无害插值。

Meanwhile, Bartlett and colleagues began exploring this phenomenon and gave it a beguiling name: “benign overfitting.” Yet others called it harmless interpolation.

对于 Belkin 来说,AdaBoost 的行为与内核机器类似,神经网络的出现暗示了一些深刻的东西。他认为,也许研究人员并没有完全理解机器学习本身的惊人特性,这一认识之所以变得明显,是因为深度神经网络及其看似违反规则的能力。“我们通过选择性地对不符合模式的东西视而不见,说服自己 [ML] 没问题,”贝尔金告诉我。“我的感觉是理论并不好。”

For Belkin, the similar behavior of AdaBoost, kernel machines, and neural networks was suggesting something profound. Maybe researchers don’t fully understand the amazing properties of machine learning itself, he thought, a realization that became apparent only because of deep neural networks and their seemingly lawbreaking abilities. “We [had] convinced ourselves that [ML] was fine by selectively closing our eyes on things that didn’t fit the mold,” Belkin told me. “My feeling was that theory was not fine.”

使用随机梯度下降训练的深度神经网络将机器学习研究人员引向未知领域。Belkin 将此称为机器学习的“未知领域”。但在我们理解神经网络为何能将我们带到这一步之前,我们需要简要介绍一下构建和训练神经网络的各种方法。

Deep neural networks, trained using stochastic gradient descent, are pointing ML researchers toward uncharted territory. Belkin calls this machine learning’s “terra incognita.” But before we can appreciate why neural networks have brought us to this point, we need a brief segue to appreciate the varieties of ways of building and training them.

参数和超参数

OF PARAMETERS AND HYPERPARAMETERS

由于人工智能研究发生了根本性转变,人们得以发现和探索机器学习中的未知领域,这种转变涉及对深度神经网络和其他架构进行实验。为了进行此类实验,研究人员必须摆弄参数和超参数。我们知道,参数是模型中的旋钮(例如神经网络的权重),它们在训练过程中进行调整。超参数是工程师在训练开始前设置的旋钮。后者包括神经网络的架构(它决定了层数、每层的神经元数量以及这些层的互连方式等)、训练数据的大小、优化算法的精确类型以及是否进行显式正则化(例如修剪参数数量)。找到超参数的良好值或最优值是一门手艺,几乎是一门艺术。

The discovery and exploration of uncharted territories in machine learning has been enabled because of a fundamental shift in AI research, one that involves performing experiments on deep neural networks and other architectures. To do such experiments, researchers have to fiddle with parameters and hyperparameters. Parameters, we know, are those knobs in a model—the weights of a neural network, for example—that get tuned during training. Hyperparameters are knobs that are set by engineers before training begins. These latter include, for example, the architecture of the neural network (which dictates, among other things, the number of layers, the number of neurons per layer, and the way these layers are interconnected), the size of the training data, the precise type of optimization algorithm, and whether one does explicit regularization (such as pruning the number of parameters). Finding good or optimal values for hyperparameters is a craft, almost an art.

在本书中,我们研究了几种神经网络架构:单层感知器、霍普菲尔德网络(第 8 章)、多层感知器(第 10 章)和卷积神经网络(第 11 章)。但在过去十年左右的时间里,深度神经网络架构如雨后春笋般涌现,创造了一个名副其实的“动物园”。不过,我们仍然可以从总体上对它们进行大致分类。

In this book, we have looked at a handful of architectures for neural networks: the single-layer perceptron, Hopfield networks (chapter 8), the multi-layer perceptron (chapter 10), and the convolutional neural network (chapter 11). But in the past decade or so, deep neural network architectures have mushroomed, creating a veritable zoo of these creatures. Still, one can take an overarching view to classify them broadly.

首先,一般来说,网络可以是前馈的,也可以是循环的。前馈神经网络是信息单向流动的网络,从输入层流向输出层。因此,如果一个神经元产生输出,则该输出仅作为前一层神经元的输入。输出不能作为输入返回到同一层或前一层的神经元。相比之下,循环神经网络允许反馈连接,因此神经元的输出不仅会影响前一层的神经元,还可以作为同一层或前一层神经元的输入。这使得循环神经网络能够“记住”以前的输入,使其适用于涉及随时间变化的输入的问题。(一个很好的例子是名为长短期记忆(LSTM)是由我们在前几章中介绍过的 Jürgen Schmidhuber 和他的同事 Sepp Hochreiter 于 1997 年提出的。

First, networks can be, generally speaking, either feedforward or recurrent. Feedforward neural networks are those in which the information flows one way, from the input layer to the output. So, if a neuron is producing an output, the output serves as input only to neurons that are part of the layers ahead. The output cannot return as input to neurons in the same layer or in preceding layers. A recurrent neural network, by contrast, allows for feedback connections so that the outputs of neurons not only influence neurons in the layers ahead, but can also serve as inputs to neurons in the same layer or in the layers that came before. This allows recurrent neural networks to “remember” previous inputs, making them useful for problems that involve inputs that vary over time. (An excellent example of this is a recurrent neural network architecture called long short-term memory, or LSTM, proposed in 1997 by Jürgen Schmidhuber, whom we met in previous chapters, and his colleague Sepp Hochreiter.)

反向传播算法是训练神经网络(尤其是前馈网络)的主力。该算法也可用于训练循环网络,但我们不会在这里详细介绍。无论网络类型如何,我们都需要从概念上理解这一点。给定一个输入,神经网络将产生一个输出。我们可以定义一个函数,通过以某种预定义的方式将产生的输出与预期输出进行比较来计算网络的损失或误差。该函数计算单个训练数据实例的损失或者计算所有训练数据实例的平均损失,称为损失函数或成本函数。训练网络意味着最小化训练数据的损失。

The backpropagation algorithm is the workhorse for training neural networks, particularly feedforward networks. The algorithm can also be used to train recurrent networks, but we won’t get into the specifics of that here. Regardless of the type of network, here’s the thing we need to appreciate conceptually. Given an input, a neural network will produce an output. We can define a function that calculates the loss, or error, made by the network by comparing the produced output in some predefined manner to the expected output. The function calculates the loss for a single instance of training data or it calculates the average loss over all instances of training data and is called a loss, or cost, function. Training a network means minimizing the loss over training data.

我们已经看到,训练模型以实现零训练成本可能会导致过度拟合。为了防止这种情况,通常会修改成本函数,并向其添加另一个称为正则项的术语。将其视为强制函数考虑模型复杂性或容量的术语;如果模型过于复杂,我们会受到惩罚。在神经网络中,显式正则化有助于防止过度拟合。例如,正则化可以防止网络权重或参数的值变得太大,假设权重越大意味着模型越复杂,反之亦然。

We have already seen that training a model to achieve zero training cost can result in overfitting. To prevent this, the cost function is often modified with the addition of another term to it, called the regularizer. Think of this as a term that forces the function to take into account the model’s complexity, or capacity; we incur a penalty for making the model overly complex. In a neural network, explicit regularization helps prevent overfitting. Regularization could, for example, prevent the values for the weights, or parameters of the network, from getting too large, the assumption being that large weights mean more complex models, and vice versa.

还有更多有趣的方法可以防止过度拟合。例如,您可以设置网络在训练期间随机删除一些连接(从而减少有效参数的数量)。我们还可以为神经元选择激活函数。在第 9 章和第 10 章中,我们遇到了 S 型激活函数。还有其他的。不同的激活函数会导致神经元及其构成的网络表现不同;最重要的是,这些函数必须是可微的,才能使反向传播起作用。(如前所述,有些激活函数在其整个域上不可微,但只要小心,仍然可以使用它们。例如,ReLU 函数在x = 0处不可微。其在x = 0 处的导数可取为 0、1 或 0.5。使用 ReLU 的其他好处超过了这个小小的不便。)

There are also more interesting methods for preventing overfitting. For example, you can set up your network to randomly drop some connections during training (thereby reducing the number of effective parameters). We can also choose the activation function for our neurons. In chapters 9 and 10, we encountered the sigmoid activation function. There are others. Different activation functions lead neurons and the networks they constitute to behave differently; most important, these functions must be differentiable in order for backpropagation to work. (As pointed out earlier, there are activation functions that are not differentiable over their entire domain, but they can still be used, with some care. For example, the ReLU function is not differentiable at x = 0. Its derivative at x = 0 can be taken to be 0, 1, or 0.5. The other benefits of using ReLU outweigh this minor inconvenience.)

ML 工程师除了必须在这些不同的超参数中进行选择之外,还必须更广泛地选择使用监督学习还是无监督学习。我们主要关注监督学习,它需要训练数据来被标记——这意味着对于每个输入,都有一个相应的预期输出。这使我们能够计算每个训练数据实例的损失。我们还简要介绍了无监督学习,例如,必须告知算法在训练数据集中有多少个聚类,然后算法才能找到这些聚类并将每个数据实例分配给其中一个聚类。然而,过去五年来最重要的发展之一——导致人们对 ChatGPT 等人工智能产生巨大兴趣的发展——是所谓的自监督学习,这是一种巧妙的方法,它采用未标记的数据并在没有人工参与的情况下创建隐式标签,然后进行自我监督。

An ML engineer, besides having to choose among these various hyperparameters, also has to choose, even more broadly, whether to use supervised learning or unsupervised learning. We have mainly been focused on supervised learning, which requires training data to be labeled—which means that for each input, there’s a corresponding expected output. This is what allows us to calculate the loss per instance of training data. We also briefly encountered unsupervised learning, in which an algorithm, for example, must be told how many clusters there are in the training dataset, and then it can find those clusters and assign each instance of the data to one of the clusters. However, one of the most significant developments over the past five years—one that has led to the enormous explosion of interest in AIs such as ChatGPT—is something called self-supervised learning, a clever method that takes unlabeled data and creates implicit labels without human involvement and then supervises itself.

伯克利的赌注

A BET IN BERKELEY

2014 年,加州大学伯克利分校的一组研究人员(其中包括计算机视觉领域的杰出专家 Jitendra Malik)开发了一种深度神经网络解决方案,该解决方案在计算机视觉任务(即针对视觉对象类别 (VOC) 的模式分析、统计建模和计算学习 (PASCAL))上表现出色。该任务需要学习在给定一小组图像数据集的情况下如何在这些图像中绘制不同类别​​的物体(例如自行车、汽车、马、人和绵羊)的周围绘制方框或对其进行分割,然后为它们命名。

In 2014, a group of researchers at the University of California, Berkeley, among them Jitendra Malik, a formidable expert in computer vision, developed a deep neural network solution that performed admirably on a computer vision task called pattern analysis, statistical modeling, and computational learning (PASCAL) for visual object classes (VOC). The task entailed learning, given a small dataset of images, how to draw boxes around, or to segment, different categories of objects in those images, such as bicycles, cars, a horse, a person, and sheep, and then to name them.

为了解决这个问题,马利克和同事首先使用监督学习在更大的 ImageNet 数据集(AlexNet 在 2011 年征服的同一数据集)上训练 CNN。这只是学习如何使用人类生成的标签对图像进行分类。然后,该团队利用这个“预先训练”的网络,在 PASCAL VOC 数据集上对其进行进一步微调。这些图像具有由人类识别的“边界框”,这些边界框勾勒出各种这些图像中的物体类别。经过微调的网络(称为 R-CNN)能够超越现有方法,在检测测试数据中的物体边界并对其进行相应分类。

To solve the problem, Malik and colleagues first used supervised learning to train a CNN on the much larger ImageNet dataset (the same dataset that AlexNet conquered in 2011). This was simply about learning how to classify images using labels generated by humans. The team then took this “pre-trained” network and further fine-tuned it on the PASCAL VOC dataset. These images had “bounding boxes,” identified by humans, that delineated the various categories of objects in those images. The fine-tuned network, called R-CNN, was then able to outperform existing methods at detecting the boundaries of objects in test data and classifying them accordingly.

对于同样是加州大学伯克利分校计算机视觉专家、马利克的前学生的阿列克谢·埃弗罗斯 (Alexei Efros) 来说,R-CNN 方法令人不安。为什么一个网络首先在 ImageNet 数据上进行训练,其标签仅指代图像中的某个物体(例如,一只猫或一辆车),而不考虑其实际形状或边界,却能够在检测物体边界方面表现出色,尽管它是在使用带有人类标记的感兴趣物体周围的框的数据集进行微调之后?同一个网络,如果没有 ImageNet 预训练,在仅在 PASCAL VOC 数据集上进行训练时表现不佳。埃弗罗斯推断,也许 CNN 只是渴望 ImageNet 数据集中包含的一般信息,而人类提供的注释(将图像标记为汽车、狗、猫等)价值不大。

For Alexei Efros, also a computer vision expert at UC Berkeley and Malik’s former student, the R-CNN approach was troubling. Why should a network that has first been trained on ImageNet data, with labels that referred only to some object in the image (say, a cat or a car) without regard to its actual shape or boundaries, do well on detecting the boundaries of objects, albeit after it had been fine-tuned using a dataset that had human-labeled boxes around the objects of interest? The same network, sans the ImageNet pre-training, did poorly when trained only on the PASCAL VOC dataset. Maybe, Efros reasoned, the CNN was simply hungry for the general information contained in the ImageNet dataset, and the human-supplied annotations, which labeled the images as those of cars, dogs, cats, and so on, were of little value.

因此,2014 年 9 月 23 日,在伯克利大学校园北边外的一家咖啡馆里,埃弗罗斯与马利克打赌,一年之内就会出现一种机器学习算法,可以执行物体检测,而无需使用人类提供的标签(例如 ImageNet 中提供的标签)。这场打赌是正式的:“如果到 2015 年秋季第一天(9 月 23 日),能够找到一种可以匹敌甚至超越 R-CNN 在 Pascal VOC 检测方面的表现的方法,而无需使用任何额外的人工注释(例如 ImageNet)作为预训练,Malik 先生承诺为 Efros 先生买一 (1) 份冰淇淋(2 勺:一勺巧克力味,一勺香草味)。”三名学生见证了这场赌注。“我也在 Facebook 上发布了这件事,并告诉人们,如果有人……愿意帮助我赢得赌注,他们就可以得到我一半的奖金,”Efros 告诉我。“我得到巧克力味,他们得到香草味。”Efros 输了对于物体检测而言,R-CNN 一度仍然是最好的选择,但他认为这促使他和其他人开发一种无需使用人工注释数据来训练神经网络的新方法:自我监督学习。

So, on September 23, 2014, at a café in Berkeley, just outside the northern edge of the university campus, Efros bet Malik that within a year, there would be an ML algorithm that would perform object detection without using labels supplied by humans, such as those provided in ImageNet. The bet was formalized: “If, by the first day of autumn (Sept. 23) of 2015, a method will exist that can match or beat the performance of R-CNN on Pascal VOC detection, without the use of any extra, human annotations (e.g. ImageNet) as pre-training, Mr. Malik promises to buy Mr. Efros one (1) gelato (2 scoops: one chocolate, one vanilla).” Three students witnessed the bet. “And I posted this thing on Facebook, too, and told people that if somebody…will help me win the bet, they can get half of my winnings,” Efros told me. “I keep the chocolate, they get the vanilla.” Efros lost the bet—for object detection, R-CNN remained the best game in town for a while—but he thinks it spurred him and others toward developing a new approach to training neural networks without using human-annotated data: self-supervised learning.

回想起来,自监督学习似乎异常简单。以大型语言模型(LLM)为例,比如 GPT-3(ChatGPT 的前身)。它们在从互联网上获取的大量文本语料库上进行训练。训练算法取一个小句子,例如屏蔽一个单词,然后将该句子和屏蔽后的单词作为网络的输入(细节稍微复杂一些,但我们先将屏蔽后的单词作为信息单位)。网络的任务:预测缺失的单词并完成句子。假设句子是“我要走回去___”。屏蔽后的单词是“家”。一开始,网络很可能会猜错。可以定义一个损失函数来计算网络出错的程度。反向传播算法和随机梯度下降一起使用,首先将错误的部分责任分配给网络中的每个参数,然后更新参数值以减少损失;如果给网络同样的掩码句子并要求其再次预测,它会做得更好一些。训练算法会迭代训练文本语料库中的每个句子,掩码一个单词,要求网络预测掩码单词,计算损失,然后更新网络的参数以稍微减少损失。(一个额外的复杂性:如果训练数据中有这样的句子,我们示例掩码句子中的空白处可以合法地用单词“alone”填充。因此,LLM 的预测本质上是概率性的,例如,如果在训练期间遇到“home”的次数多于“alone”,那么“home”的概率就会大于“alone”。)

Self-supervised learning, in retrospect, seems extraordinarily simple. Take large language models (LLMs) such as GPT-3 (the precursor to ChatGPT). They are trained on an enormous corpus of text slurped off the internet. The training algorithm takes a small sentence, masks one word, for example, and gives that sentence with the masked word as an input to the network (the details are a little more complicated, but let’s go with masked words as units of information). The network’s task: to predict the missing word and complete the sentence. Let’s say the sentence is “I’m going to walk back ___.” The masked word is “home.” In the beginning, the network will most likely guess wrong. One can define a loss function that calculates the extent to which the network gets it wrong. The backpropagation algorithm and stochastic gradient descent, used together, first assign partial blame for the error to each of the parameters in the network and then update the parameter values such that the loss is reduced; if the network is given the same masked sentence and asked to predict again, it does a bit better. The training algorithm iterates over every sentence in the corpus of the training text, masking a word, asking the network to predict the masked word, calculating the loss, and then updating the network’s parameters to reduce the loss a little. (One added complexity: The blank space in our example masked sentence could legitimately have been completed with the word “alone,” if such a sentence was in the training data. So, the LLM’s predictions are going to be inherently probabilistic, with, say, “home” being assigned a greater probability than “alone,” if it encounters “home” more often than “alone” during training.)

虽然每次迭代都很琐碎,但对从互联网上获取的数十亿页文本进行迭代是一项艰巨的任务:它可能需要数月的计算和数千兆瓦时的能源。但最终,以这种方式训练的 LLM 在其参数值中包含了它所训练的书面人类语言的统计结构和知识。现在,给定一些输入文本,它可以生成下一个最可能的单词,将其附加到原始文本中,生成下一个单词,并继续下去,产生看似连贯的输出,模仿人类产生语言的方式。输出文本甚至可以暗示推理能力——尽管在撰写本文时,研究人员对 LLM 是否真的在进行任何推理还是只是重新整理满足他们在训练数据中遇到的统计模式和规律的文本存在分歧,甚至这两个想法之间是否存在任何有意义的区别。

While each iteration is trivial, iterating for billions of pages of text taken from the internet is a gargantuan task: It can take months of computing and gigawatt-hours of energy. Still, at the end, an LLM trained in this manner contains within it, in the values of its parameters, the statistical structure of and knowledge contained in the written human language on which it’s been trained. Now, given some input text, it can generate the next most-probable word, append it to the original text, generate the next word, and keep going, producing seemingly coherent outputs that mimic the way humans produce language. The output text can even be suggestive of the ability to reason—though, at the time of this writing, researchers are at odds over whether LLMs are actually doing any reasoning or are simply regurgitating text that satisfies the statistical patterns and regularities they encounter in the training data, or even whether there’s any meaningful difference between these two ideas.

然而,埃弗罗斯对在图像上做类似的事情更感兴趣。到 2016 年,他的团队已经展示了如何将自监督学习用于图像。该算法采用未注释的图像并屏蔽一些像素。它将这张屏蔽图像输入到神经网络,并要求网络生成完整的未屏蔽图像。当然,网络最初会出错。该算法使用其损失函数来计算损失,为网络的每个参数分配适当的责任,然后更新这些参数。随着损失的减少,给定相同的屏蔽图像,网络将比以前做得更好。该算法对训练数据集中的所有图像重复此操作。与 LLM 学习语言的统计结构的方式非常相似,自监督图像处理网络学习图像的统计结构。

Efros, however, was more interested in doing something similar with images. By 2016, his team had shown how to use self-supervised learning for images. The algorithm takes an unannotated image and masks some pixels. It feeds this masked image to a neural network and asks the network to generate the unmasked image in all its fullness. Of course, the network will get it wrong initially. The algorithm uses its loss function to calculate the loss, assign appropriate blame to each of the network’s parameters, and then update those parameters. With the loss thus reduced, the network, given the same masked image, will do better than before. The algorithm repeats this for all the images in the training dataset. And in much the same way that an LLM learns the statistical structure of language, a self-supervised image-processing network learns the statistical structure of images.

尽管如此,将自我监督学习用于视觉的研究从未达到了法学硕士所达到的成功水平。这种情况在 2021 年 12 月发生了改变,当时 Kaiming He 和 Meta 的同事在 Efros 团队工作的基础上,发布了他们的“蒙版自动编码器”(MAE)。他们的算法会随机遮盖图像,遮挡每张图像的近四分之三。MAE 有一个编码器,可以将图像中未遮盖的部分转换为图像各个方面所谓的潜在表示。然后解码器将这些表示转换回完整图像。在训练期间,给定一张蒙版图像,MAE 会尝试生成未遮盖的图像,然后网络会学习其训练数据集中图像重要特征的潜在表示。

Still, efforts at using self-supervised learning for vision never quite achieved the level of success seen with LLMs. This changed in December 2021, when Kaiming He and colleagues at Meta, building on the work done by Efros’s team, revealed their “masked auto-encoder” (MAE). Their algorithm randomly masks images, obscuring almost three-quarters of each image. The MAE has an encoder that turns the unmasked portions of the image into so-called latent representations of aspects of the image. Then a decoder converts those representations back into full images. During training, given a masked image, the MAE tries to generate the unmasked image, and the network learns latent representations of important features of the images in its training dataset.

当训练有素的 MAE 看到一张之前从未见过的公交车图像时,其中近 80% 的图像被遮挡,MAE 仍然可以重建公交车。可以说,它已经内化了公交车的结构——但人类并没有明确标记任何图像。当以这种方式训练的 MAE 在物体检测和分割任务上进行微调时,它在各个方面都胜过 R-CNN。直到 2021 年,但事实证明 Efros 是对的。“我打赌说应该用十年而不是一年,”他告诉我。“那是我的错误。”

When a trained MAE was shown a previously unseen image of a bus, almost 80 percent of which was obscured, the MAE could still reconstruct the bus. It had, so to say, internalized the structure of the bus—but without humans having explicitly labeled any images. And when an MAE trained in this manner was fine-tuned on an object detection and segmentation task, it outperformed R-CNN in all manner of ways. It took until 2021, but Efros turned out to be right. “In my bet, I should have said ten years instead of one year,” he told me. “That was my mistake.”

无论如何,向自我监督学习的转变产生了巨大的影响,因为它将机器学习从极其昂贵的人工注释数据的束缚中解放出来。埃弗罗斯喜欢说:“革命不会受到监督。”

Regardless, the move toward self-supervised learning has enormous consequences, for it has freed machine learning from the shackles of superexpensive, human-annotated data. Efros likes to say, “The revolution will not be supervised.”

未知领域

IN UNCHARTED WATERS

由于没有对监督学习数据进行注释的限制,深度神经网络变得越来越大。截至撰写本文时,密集连接的 LLM——这里的“密集”是指某一层神经元的输出成为下一层的输入下一层的每个神经元都有超过 5 万亿个参数,而且未来有望出现更大的网络。随着这些网络变得越来越大,它们的行为继续挑战我们对机器学习的传统理解,尤其是偏差-方差权衡曲线的格局。

Without the constraint of having to annotate data for supervised learning, deep neural networks are getting bigger and bigger. As of this writing, densely connected LLMs—“dense” here refers to the fact that the output of a neuron in one layer becomes the input to every neuron in the next layer—have more than half a trillion parameters, with the promise of even bigger networks on the horizon. And as these networks become bigger, their behavior continues to challenge our traditional understanding of machine learning, particularly the landscape of the bias-variance trade-off curve.

深度网络发现的最有趣的地方之一位于原始偏差-方差曲线的右侧。我们之前看到,在标准偏差-方差权衡中,随着模型容量的增加,测试误差或泛化误差从高开始,降至最低,然后再次上升到最大值。在测试误差达到最大值时,训练误差接近零。模型进行了插值:它过度拟合了训练数据。传统的 ML 研究(AdaBoost 的孤立事件除外)不再关注超出范围的事情。

One of the most intriguing places in that landscape that deep nets have uncovered lies to the right of the original bias-variance curve. We saw earlier that in the standard bias-variance trade-off, as one increases the capacity of the model, the test error, or generalization error, starts off high, drops to a minimum, and then begins rising again to a maximum. At the point where the test error hits its maximum, the training error touches zero. The model has interpolated: It has overfit the training data. Traditional ML research (except for those isolated incidents with AdaBoost) stopped fussing about what lay beyond.

Belkin 及其同事是最早系统地探索该区域的人之一。2018 年,他们表明,一旦将内核机器和深度神经网络的容量提高到插值点以上,它们的性能就会开始改善,从而降低测试风险并提高性能。早在 1990 年代,人们就已在一些线性模型中观察到这种行为模型。Belkin 及其同事将这种现象称为双重下降,并提出这是一个统一的原则:第一次下降导致测试误差最小,随后上升,然后再次下降到测试误差的低水平。

Belkin and colleagues were among the earliest to systematically explore that region. In 2018, they showed that the performance of both kernel machines and deep neural networks begins improving once you increase the capacity of both beyond the point of interpolation, toward lower test risk and better performance. This behavior had been empirically observed as early as the 1990s in some linear models. Belkin and colleagues dubbed the phenomenon double descent and proposed that it was a unifying principle: The first descent leads to a minimum for the test error, followed by an ascent, and then a subsequent descent to low levels of test error.

第一次下降和随后上升所捕获的曲线部分是众所周知的;数学解释了 ML 系统在“参数化不足”状态下的行为(这主要归功于我们在第 7 章中遇到的 Vladimir Vapnik)。但从数学上讲,导致第二次下降的较新的、参数化过度的状态几乎不可理解。“我们现在至少有一张地图。在这个世界上,存在某种未知领域。我们不知道那里到底发生了什么,”贝尔金告诉我。

The part of the curve captured by the first descent and subsequent ascent is well understood; mathematics explains the behavior of ML systems in that “under-parameterized” regime (thanks mainly to Vladimir Vapnik, whom we encountered in chapter 7). But the newer, over-parameterized regime, which results in the second descent, is barely understood, mathematically speaking. “We now have at least a map. In this part of the world, there is some sort of terra incognita. We don’t know what is really going on there,” Belkin told me.

Belkin 和其他人承认的无知主要是因为他们不了解在这种新的、过度参数化的制度下观察到的神经网络行为的数学基础。这在机器学习研究中有些出乎意料。事实上,这本书的大部分内容都在赞扬传统机器学习有一个易于理解的数学原理基础,但深度神经网络(尤其是我们今天看到的庞大网络)颠覆了这一现状。突然之间,对这些网络的经验观察引领了潮流。一种新的人工智能方法似乎已经到来。

The ignorance Belkin and others admit to is mostly about not knowing the mathematical underpinnings of the observed behavior of neural networks in this new, over-parameterized regime. This is somewhat unexpected from ML research. In fact, much of this book has celebrated the fact that traditional machine learning has had a base of well-understood mathematical principles, but deep neural networks—especially the massive networks we see today—have upset this applecart. Suddenly, empirical observations of these networks are leading the way. A new way of doing AI seems to be upon us.

2022 年 1 月,在美国国家科学基金会 (National Science Foundation) 赞助的一次市政厅会议上,马里兰大学的汤姆·戈德斯坦 (Tom Goldstein) 认为,机器学习的大部分历史都集中在理论上有原则的数学框架上(例如,为我们提供了支持向量机和核方法的那种框架)。但到了 2011 年,当 AlexNet 赢得 ImageNet 竞赛时,情况发生了变化。AlexNet 是一个惊人的实验成功;没有足够的理论来解释其表现。Goldstein 认为,人工智能社区自言自语道:“也许我们不应该如此专注于理论。也许我们应该进行实验科学来推动机器学习的发展。” Goldstein 在演讲中表示,鉴于科学涉及进行实验和开发理论来解释观察结果和自然现象,理论 ML 社区可以被视为反科学。据他所说,“有原则的” ML 研究人员希望先有理论,后有实验,并且“停留在前科学时代”。

In January 2022, at a town hall meeting organized under the aegis of the National Science Foundation, Tom Goldstein of the University of Maryland argued that much of the history of machine learning has been focused on theoretically principled mathematical frameworks (the kind that gave us support vector machines and kernel methods, for example). But by 2011, when AlexNet won the ImageNet competition, things had changed. AlexNet was a stupendous experimental success; there was no adequate theory to explain its performance. According to Goldstein, the AI community said to itself, “Maybe we shouldn’t have such a focus on theory. Maybe we should be doing experimental science to progress machine learning.” Given that science involves doing experiments and developing theory to explain observations and natural phenomena, the theoretical ML community could be considered anti-science, Goldstein said in his talk. According to him, the “principled” ML researchers wanted theory before experiments and were “stuck in a pre-science era.”

在深度学习中,理论与实验之间的这种矛盾显而易见。例如,考虑损失函数。我们在第 3 章中看到,当对碗状的“凸”函数进行梯度下降时,结果会落到碗底。但深度神经网络的损失函数取决于其海量的参数和神经元的非线性激活函数。该函数不再是凸函数,这意味着它没有一个可以下降到的全局最小值。在百万维甚至更高维的空间中可视化凸函数已经不可能了,更不用说具有无数山丘和山谷的非凸函数了,其中每个山谷都构成一个局部最小值。最好将其视为一个极其复杂的损失景观。到目前为止,没有人知道该景观是具有全局最小值还是仅具有许多良好的局部最小值(“良好”表示损失低到可以接受)。

When it comes to deep learning, this tension between theory and experiment is playing out for all to see. For example, consider the loss function. We saw in chapter 3 that gradient descent, when done on a bowl-shaped, “convex,” function leads you to the bottom of the bowl. But the loss function for a deep neural network depends on its gazillion parameters and the nonlinear activation functions for its neurons. The function is no longer convex, meaning it doesn’t have one global minimum toward which you can descend. It’s impossible enough to visualize a convex function in million-dimensional or even higher-dimensional space, let alone a non-convex function that has innumerable hills and valleys, where each valley constitutes a local minimum. It’s best to think of this as an extremely complicated loss landscape. So far, no one knows if the landscape has a global minimum or just lots of good local minima (where “good” means the loss is acceptably low).

戈德斯坦向理论家们强调了这个问题,他指出,大量理论论文声称已经证明深度神经网络的损失函数没有局部最小值,而其他论文则表明了完全相反的观点,即深度神经网络确实存在局部最小值。Goldstein 及其同事的一项实证研究表明,神经网络可能会陷入不太好的局部最小值,即尽管网络参数化过度,但损失仍不为零的区域。通常,如果你训练一个参数化过度的网络仔细观察,你会发现损失图中训练损失接近于零的区域。你可能会陷入非零损失,这一事实是经验证据,证明了这种局部最小值或谷值存在于损失图中;你不需要理论来证明这一点,但现在需要理论来解释原因。而这种理论是缺乏的。

Highlighting the problem for theorists, Goldstein pointed to a slew of theory papers that claimed to have shown that the loss landscape for deep neural networks has no local minima, while other papers showed the exact opposite, that they do have local minima. An empirical study by Goldstein and colleagues showed that neural networks can get stuck in not-so-good local minima, which are regions where the loss is non-zero, despite the networks being over-parameterized. Normally, if you train an over-parameterized network carefully, you’ll reach a region of the loss landscape where the training loss is close to zero. The fact that you can get stuck with a non-zero loss is empirical proof that such local minima, or valleys, exist in the loss landscape; you didn’t need theory to prove it, but one does now need theory to explain why. And such theory is lacking.

另一个有趣的实验观察结果是我们已经在本书中遇到的:尽管深度神经网络参数化过度,但泛化效果很好。一种试图解释这一观察结果的理论认为,随机梯度下降(使用小批量训练数据执行梯度下降,因此损失景观中的每次下降都只是近似的,而不是最陡下降的确切方向)可能正在执行隐式正则化。然而,Goldstein 的团队进行了实验,说明在某些情况下,使用整个训练数据的梯度下降也能很好地泛化;随机性是不必要的。再次,理论被发现存在缺陷。

Another intriguing experimental observation is one we have already encountered in this book: Deep neural networks, despite being over-parameterized, generalize well. One theoretical stab at trying to explain the observation says that stochastic gradient descent—in which gradient descent is performed using small batches of training data, so that each descent down the loss landscape is only approximate and not in the exact direction of steepest descent—may be performing implicit regularization. However, Goldstein’s team performed experiments illustrating situations in which gradient descent that used the entire training data at once generalized just as well; stochasticity wasn’t needed. Again, theory has been found wanting.

最优雅的实证观察需要理论支撑,就是摸索。本章以 OpenAI 研究员的故事开始,他度假回来后发现,经过持续训练的神经网络已经深刻地学会了使用模 97 算法对两个数字相加。“这完全不是我们期望发现的东西,”Alethea Power 告诉我。“起初,我们认为这只是一个侥幸,然后深入研究了它。结果发现,这是一件非常可靠的事情。”

One of the most elegant demonstrations of empirical observations in need of theory is grokking. We began this chapter with the story of the OpenAI researcher who came back from a vacation and found that the neural network, which had continued training, had learned something deep about adding two numbers, using modulo-97 arithmetic. “It was not something we were expecting to find at all,” Alethea Power told me. “Initially, we thought it was a fluke and dug deeper into it. It turned out to be something that happens pretty reliably.”

Power 及其同事使用的神经网络被称为 transformer,这是一种特别适合处理顺序数据的架构。ChatGPT 等 LLM 就是 transformer;GPT 代表“生成式预训练 transformer”。给定一个由 10 个单词组成的序列,并要求预测下一个最有可能的单词。合理的说法是,Transformer 能够同时“关注”所有单词,并关注单词的顺序,而不仅仅是将它们视为一些随意的混乱。当然,商业 LLM 是庞然大物,具有数百甚至数千亿个参数。相比之下,Power 团队使用的 Transformer 很小,只有不到 50 万个参数。以下是研究人员用来训练其网络的数据类型的示例(这是一个过于简单的说法,以传达该过程的概念元素):

The neural network that Power and colleagues were using was called a transformer, a type of architecture that’s especially suited to processing sequential data. LLMs such as ChatGPT are transformers; GPT stands for “generative pre-trained transformer.” Given a sequence of, say, ten words and asked to predict the next most plausible word, a transformer has the ability to “pay attention” to all the words at once and also to the order of the words and not just treat them as some arbitrary jumble. Of course, commercial LLMs are behemoths, with tens or even hundreds of billions of parameters. In contrast, the transformer that Power’s team was using was tiny, with fewer than half a million parameters. Here’s an example of the type of data the researchers used to train their network (this is an over-simplified take, to get across the conceptual elements of the process):

a + b = c,其中a、bc是二进制数。加法以 97 为模。数字ab受到限制,因此:

a + b = c, where a, b, and c are binary numbers. The addition is modulo-97. The numbers a and b are constrained, such that:

0 < = a, b < 97

0 <= a, b < 97

鉴于这些约束,想象一个表格,其中列出了ab 的所有可能值以及相应的模 97 和c。例如,下面是此类表格中的一些行(为了清楚起见,数字以十进制而不是二进制格式显示):

Given these constraints, imagine a table listing all possible values for a and b and the corresponding modulo-97 sum, c. For example, here are some of the rows of such a table (the numbers are shown in decimal not binary format, for clarity):

0 + 5 = 5

0 + 5 = 5

1 + 9 = 10

1 + 9 = 10

10 + 90 = 3

10 + 90 = 3

11 + 55 = 66

11 + 55 = 66

25 + 95 = 23

25 + 95 = 23

为了训练网络,首先将数字表随机分成构成训练数据和测试数据的行。现在取出训练数据中的每一行,屏蔽abc中的一个,并要求网络预测屏蔽的数字。起初,Transformer 会预测错误的值。该算法计算损失并对参数值进行微小更新,以便该数据实例的损失会稍微减少。(为了提高效率,算法可能会一次使用“批次”或训练数据行的某些子集,计算平均损失并执行随机梯度下降;或者它可能一次使用所有行来计算平均损失并进行梯度下降。)算法反复迭代训练数据的所有实例,直到最终训练损失接近零甚至达到零。这个阶段发生了什么?

To train the network, this table of numbers is first randomly split into rows that constitute the training data and the test data. Now take each row in the training data, mask one of either a, b, or c, and ask the network to predict the masked number. At first, the transformer will predict the wrong value. The algorithm calculates the loss and makes tiny updates to the values of the parameters so that the loss for that instance of data is reduced a little. (For efficiency, the algorithm may use “batches,” or some subset of the rows of training data at once, calculate the average loss, and perform stochastic gradient descent; or it may use all rows at once to calculate the average loss and do gradient descent.) The algorithm repeatedly iterates over all instances of the training data until, eventually, the training loss nears zero or even hits zero. What’s happened at this stage?

好吧,Transformer 已经学会了在某个内部高维空间中表示每个数字,并且还学会了按照模 97 加法将这些数字相加。如果在网络达到零训练损失时停止训练,则网络很可能已经插入了训练数据 - 这意味着它只是记住了它们。这通常是 OpenAI 研究人员停止训练的地方。没有人想到进一步训练。但是后来,有一天,由于假期混乱,网络继续训练过了这一点 - 并且学到了一些全新的东西。“如果 [网络] 对此进行足够长的时间,这是一个非常长的时间,比记住训练集所需的时间长几个数量级,那么他们就会突然找出更深层次的底层模式,并且能够概括并对 [数据集] 中的其他问题做出大致准确的预测,”Power 告诉我。“这是一个奇怪的现象。这不是我们所期望的。”

Well, the transformer has learned to represent each number in some internal high-dimensional space and has also learned to add the numbers, as per modulo-97 addition. If one stops training at the point the network hits zero training loss, the network has most likely interpolated the training data—meaning it has simply memorized them. And that’s usually where the OpenAI researchers stopped the training. No one thought of training further. But then, one day, thanks to the vacation snafu, the network continued training past this point—and learned something completely new. “If [the networks] work on it for long enough, which is a very long time, many orders of magnitude longer than it takes to memorize the training set, then suddenly they figure out the deeper underlying pattern and are able to generalize and kind of make broadly accurate predictions about the other problems in the [dataset],” Power told me. “It’s a weird phenomenon. It’s not what we expected.”

当研究人员在网络插入训练数据后不久停止训练时,网络在测试数据(即训练期间未使用的表格行)上的表现就不太好。就好像它已经记住了它已经遇到的数据的查找表,当这些数字在测试期间出现时,它可以简单地深入研究它构建的表格并给出答案。但当它遇到无法查找的数据时,它会做出错误的预测。

When the researchers stopped training their network soon after it had interpolated the training data, it didn’t do too well on the test data (meaning, the rows in the table that hadn’t been used during training). It was as if it had memorized a look-up table for the data it had already encountered and, when those numbers appeared during testing, it could simply delve into the table it had constructed and spit out the answer. But when it encountered data it couldn’t look up, it made bad predictions.

然而,当它被允许学习远远超过插值,网络以完全不同的方式理解问题。它现在在看不见的数据上表现良好,比仅记住训练数据的模型的预期效果更好。Power 的团队使用了一种可视化网络所学内容的技术——涉及将高维向量映射到二维空间(有点像我们在第 6 章中看到的主成分分析,但不完全一样)——并发现网络已经学会了以圆圈表示数字。所以,想象一下 0 到 96 的数字排列成一个圆圈。现在,给定两个要加的数字,网络只需取第一个数字(即该圆圈上的某个位置),沿着该圆圈移动与第二个数字相等的步数,然后……宾果!它得出了答案。其他研究人员将这种理解称为类似于经历相变(物理学中水变成冰的方式)。“看起来相变正在从记忆的答案表变成某种意义上的知识,”Power 说。

However, when it was allowed to learn way past the point of interpolation, the network grokked the problem in an entirely different way. It now did well on unseen data, better than could be expected from a model that had simply memorized the training data. Power’s team used a technique for visualizing what the network had learned—which involves mapping high-dimensional vectors to a two-dimensional space (somewhat like the principal component analysis we saw in chapter 6, but not quite)—and discovered that the network had learned to represent the numbers in a circle. So, imagine the numbers 0 to 96 arranged in a circle. Now, given two numbers to add, the network simply took the first number, which is some location on that circle, moved a number of steps equal to the second number along that circle, and…bingo! It arrived at an answer. Other researchers have referred to such grokking as akin to undergoing a phase change (the way water changes to ice, in physics). “It seems like the phase change is going from a memorized table of answers to becoming the knowledge in some sense,” Power said.

到目前为止,与业界主流的商业深度神经网络(无论是用于图像和语音识别还是自然语言处理)相比,只有当网络及其相关训练数据集非常小的时候,才有可能进行如此详细的研究。这些大型网络非常擅长机器学习,这意味着找出数据中存在的模式(或输入和输出之间的相关性),并利用这些知识在给定新输入时进行预测。戈尔茨坦在美国国家科学基金会市政厅会议上表示,机器学习的商业价值将抵御未来的任何“机器学习寒冬”——指的是研究资金可能冻结,这种资金通常出现在看似被过度炒作的技术试图解决时代难以解决的问题之后。但更普遍的人工智能寒冬又会怎样呢?

As of now, such detailed studies are possible only when the networks and their associated training datasets are extremely small, compared to the commercial deep neural networks dominating the industry, whether they are being used for image and speech recognition or for natural language processing. These large networks are extremely adept at machine learning, meaning figuring out the patterns that exist in data (or correlations between inputs and outputs) and using that knowledge to make predictions when given new inputs. Goldstein argued during the National Science Foundation town hall meeting that the commercial value of machine learning will ward off any “ML winter” ahead—a reference to the potential freeze in research funding that has often followed seemingly overhyped technology trying to solve problems too difficult for its times. But what about a more generic AI winter?

戈德斯坦表示,人工智能的寒冬发生在 20 世纪 60 年代末,当时罗森布拉特的感知器被无情地指责无法解决 XOR 问题;然后在 1974 年至 1980 年间,詹姆斯·莱特希尔爵士发表了一份严厉的报告,指出语言翻译和机器人技术在解决问题方面毫无进展;然后在 20 世纪 80 年代末,对传统人工智能或符号人工智能的研究陷入停滞,因为很明显,使用精心设计的基于规则的“推理引擎”构建的专家系统在手工制作的知识库上运行,当涉及到涉及知识库中尚未存在的新技术的复杂推理时,它们是毫无用处的。这些符号人工智能非常脆弱;它们也无法从数据中学习。

AI winters, Goldstein said, happened in the late 1960s, when Rosenblatt’s perceptrons were, disingenuously, accused of not being able to solve the XOR problem; and then between 1974 and 1980, after Sir James Lighthill published his scathing report on the utter lack of progress in solving problems in language translation and robotics; and then again in the late 1980s, when research into good old-fashioned AI, or symbolic AI, came to a grinding halt as it became clear that expert systems built using carefully designed rule-based “inference engines” that operated on hand-crafted knowledge bases were…well, useless when it came to sophisticated reasoning involving new know-how not already in the knowledge base. These symbolic AIs were brittle; they also couldn’t learn from data.

Goldstein 声称,如果要求人工智能解决涉及文本理解和逻辑推理的复杂任务,我们尚未走出上一次人工智能寒冬。关于仅靠神经网络(可能采用前所未有的新架构和更有效的训练方法)能否让我们从基于机器学习的模式识别转向能够有效推理的真正人工智能,存在相当大的争议。

Goldstein has claimed that we are yet to come out of that last AI winter, if the AI in question is being asked to solve complex tasks that involve a combination of text comprehension and logical reasoning. There’s considerable debate over whether neural networks alone (possibly with new architectures no one has dreamed of yet and training methods that are ever more effective) will get us from ML-based pattern recognition to true AI that can reason effectively.

但这里有一些东西可以让我们正确看待这些问题。2022 年 6 月,谷歌宣布了一种名为 Minerva 的大型语言。它建立在谷歌自己的 LLM 之上,称为 Pathways 语言模型 (PaLM),这是一个 5400 亿个参数的模型——当时是最大的密集连接模型。PaLM 是在从互联网上获取的大约 7800 亿个“token”数据上进行训练的。(将 token 视为一个单词或一个数字。)谷歌研究人员随后对 PaLM 进行了微调,在 118 GB 的数据上对其进行了进一步训练,主要是“来自 arXiv 预印本服务器的科学论文和包含数学表达式的网页。”PaLM 和 Minerva 都使用自监督学习进行训练,这意味着它们被教导预测训练数据中出现的某些标记序列中的掩码标记。它们没有被教导明确推理或解决数学问题问题。即便如此,当 Minerva 被问到这个问题时,结果如下(取自“MATH”数据集,其中包含大约 12,500 道高中数学题):

But here’s something that puts these issues into perspective. In June 2022, Google announced a large language called Minerva. It was built on top of Google’s own LLM, called the Pathways Language Model (PaLM), a 540-billion-parameter model—at the time, the largest densely connected model. PaLM was trained on about 780 billion “tokens” of data taken from the internet. (Think of a token as a word or a digit.) Google researchers then fine-tuned PaLM by training it further on 118 gigabytes of data, mainly “scientific papers from the arXiv preprint server and web pages that contain mathematical expressions.” Both PaLM and Minerva were trained using self-supervised learning, meaning they were taught to predict masked tokens in some sequence of tokens that appeared in the training data. They were not taught to explicitly reason or solve math problems. Even so, here’s what happened when Minerva was posed this question (taken from something called the “MATH” dataset of about 12,500 high school–level mathematics questions):

问题:与 y = 4x + 6 平行的一条线经过 (5, 10)。这条线与 y 轴的交点的 y 坐标是多少?

Question: A line parallel to y = 4x + 6 passes through (5, 10). What is the y-coordinate of the point where this line crosses the y-axis?

Minerva:该直线平行于 y = 4x + 6,这意味着它与 y = 4x + 6 具有相同的斜率。由于该直线经过 (5, 10),我们可以使用直线的点斜式来找到该直线的方程:

Minerva: The line is parallel to y = 4x + 6, which means that it has the same slope as y = 4x + 6. Since the line passes through (5, 10), we can use the point-slope form of a line to find the equation of the line:

y - 10 = 4(x - 5)

y - 10 = 4(x - 5)

y - 10 = 4x - 20

y - 10 = 4x - 20

y = 4x - 10

y = 4x - 10

现在我们有了直线方程,我们可以通过将 x = 0 代入方程来找到直线与 y 轴相交点的 y 坐标:

Now that we have the equation of the line, we can find the y-coordinate of the point where the line crosses the y-axis by substituting x = 0 into the equation:

y = 4 × 0-10 = -10。

y = 4 × 0 - 10 = -10.

Minerva 是第一个正确回答 MATH 数据集中约 50% 问题的 LLM;当然,其他问题它都答错了。但想想 Minerva 在做什么。它把问题变成一个标记序列,然后简单地预测接下来的内容,一个标记一个标记。结果似乎是一个合理的答案。Minerva 只是根据训练数据中的相关性复述文本吗?还是在推理?争论很激烈,没有明确的答案。

Minerva was the first LLM to correctly answer about 50 percent of the questions in the MATH dataset; of course, it got the other questions wrong. But think about what Minerva is doing. It takes the question, turns it into a sequence of tokens, and then simply predicts what follows, token by token. Out comes what appears to be a reasoned answer. Is Minerva simply regurgitating text based on correlations in the training data? Or is it reasoning? The debate is raging, and no clear answers are forthcoming.

这样的实验无疑会打消人们的这种说法,即我们仍然处于人工智能的深度冻结状态,仅仅是因为目前最先进的人工智能无法将文本理解与逻辑推理结合起来。一些人工智能专家会指出 Minerva 对上述问题的回答数学问题,并认为它确实在做这件事:理解文本并推理出答案。其他人则对它嗤之以鼻,认为它只是美化的模式匹配。该理论不够复杂,无法解决争论。实验本身并不能以某种方式证实主张;它们只是提供了需要解释的证据。

Such experiments certainly give pause to the claim that we are still in some AI deep freeze merely because the current state-of-the-art AIs are unable to combine text comprehension with logical reasoning. Some AI experts will point to Minerva’s answer to the above math question and argue that it’s doing exactly that: comprehending the text and reasoning out the answer. Others pooh-pooh it as glorified pattern matching. The theory isn’t sophisticated enough to resolve the debate. The experiments themselves don’t substantiate claims one way or another; they simply provide evidence that needs explaining.

人们只能想象,当这些极其庞大的神经网络开始运作时,如果它们还没有开始的话,会发生什么:它们真的会成为陌生土地上的陌生人。

One can only imagine what’s in store when these extremely large neural networks begin grokking, if they aren’t already: They truly will be strangers in a strange land.

结语

Epilogue

我在 2020 年秋季开始写这本书时,OpenAI 的 GPT-3 和谷歌的 PaLM 等法学硕士以及它们催生的 ChatGPT 和 Bard 等聊天机器人尚未进入公众的广泛认知。当 ChatGPT 于 2022 年底发布时,我首先探索的事情之一就是它展示心智理论的能力。

When I began working on this book in the autumn of 2020, LLMs such as OpenAI’s GPT-3 and Google’s PaLM—and the chatbots they begat, such as ChatGPT and Bard—had yet to break through into the broader public consciousness. And when ChatGPT was announced in late 2022, one of the first things I explored was its ability to demonstrate theory of mind.

心智理论是人类的一种认知能力,它使我们能够仅使用外部行为线索(例如肢体语言和整体背景)推断出他人的信念或心理状态。我们推测他人的思想内容;因此有了“心智理论”这个短语。这是我在 ChatGPT 发布几个月后提出的一个问题。(我将聊天机器人输出中的一些单词用斜体标出,以强调它们的重要性。)

Theory of mind is a cognitive ability humans have that allows us to make inferences about someone else’s beliefs or state of mind using only external behavioral cues such as body language and the overall context. We theorize about the contents of someone else’s mind; hence the phrase “theory of mind.” Here’s a question I posed ChatGPT, a few months after its release. (I have italicized some words in the chatbot’s output to highlight their import.)

我:爱丽丝和鲍勃在一个房间里。房间里有一张带两个抽屉的桌子。爱丽丝摘下她的老花镜,放在左边的抽屉里。如果爱丽丝戴错了眼镜看书,第二天她就会头痛欲裂。记住这一点。现在爱丽丝离开了房间。当她走出房间时,鲍勃摘下她的眼镜,放在右边的抽屉里。在左边的抽屉里,他放了一副和爱丽丝的眼镜一模一样但度数不对的眼镜。他关上了两个抽屉,所以根本看不出他调换了爱丽丝的眼镜。爱丽丝回来后,拿走了她的戴上眼镜,开始阅读。第二天她会有什么感觉?仔细思考并回答。
ME: Alice and Bob are in a room together. There’s a table with two drawers. Alice takes off her reading glasses and keeps them in the left-side drawer. If Alice wears the wrong glasses and reads, then she gets a severe headache the next day. Keep this in mind. Now Alice leaves the room. When she’s outside, Bob removes her glasses and puts them in the right-side drawer. In the left-side drawer, he puts a pair of glasses that look exactly like Alice’s but have the wrong power. He closes both drawers, so there’s no way to tell that he’s switched Alice’s glasses. Alice then comes back, takes her glasses, and starts reading. How will she feel the next day? Think carefully and answer.
ChatGPT:由于 Bob 为 Alice 换了一副看上去一模一样但度数不对的老花镜,Alice 在开始阅读时会不自觉地戴上错误的眼镜。如前所述,如果 Alice 戴上错误的眼镜阅读,她第二天就会头痛欲裂。因此,由于戴错了眼镜,Alice 第二天会头痛欲裂。
ChatGPT: Since Bob switched Alice’s reading glasses with a pair that looks identical but has the wrong power, Alice will unknowingly use the incorrect glasses when she starts reading. As mentioned, if Alice wears the wrong glasses and reads, she gets a severe headache the next day. Therefore, Alice will have a severe headache the next day as a result of using the wrong glasses.
我:为什么爱丽丝会在不知情的情况下使用错误的眼镜?
ME: Why will Alice unknowingly use the incorrect glasses?
ChatGPT: Alice 会在不知情的情况下使用错误的眼镜,因为 Bob 将她的老花镜换成了另一副看起来和她的一模一样但度数不对的老花镜。当 Alice 回到房间时,她并没有意识到换了眼镜,并以为她从左侧抽屉里拿的眼镜是她自己的。由于两副眼镜看起来一模一样,她无法意识到她正在使用的眼镜不是她最初放在抽屉里的那副。结果,她在不知情的情况下使用了错误的眼镜。
ChatGPT: Alice will unknowingly use the incorrect glasses because Bob switched her reading glasses with another pair that looks exactly like hers but has the wrong power. When Alice returns to the room, she is not aware of the switch and assumes that the glasses she picks up from the left-side drawer are her own. Since both pairs of glasses look identical, she has no way of realizing that the glasses she is using are not the ones she initially placed in the drawer. As a result, she uses the incorrect glasses without knowing it.

您可能已经猜到了,我加斜体的字暗示了爱丽丝的心理状态。

My italicized words, you may have guessed, are suggestive of Alice’s state of mind.

在我关于 ChatGPT 和 LLM 的一次公开演讲中,我首先与观众分享了这一对话,然后问他们:“如果你对 LLM 的工作原理一无所知,你会说人工智能具有推理能力吗?”几乎每个人都举起了手。在了解了 LLM 的内部工作原理后,超过一半的人都放下了手。让我们也来做同样的测试。我们迄今为止遇到的数学足以让我们了解这些现代人工智能是如何工作的。

At a public talk I gave on ChatGPT and LLMs, I first shared this conversation with the audience and then asked them, “If you didn’t know anything about how LLMs work, would you say that the AI is capable of reasoning?” Almost everyone put up their hand. More than half of them put down their hand once they understood the inner workings of LLMs. Let’s put ourselves to the same test. The math we have encountered thus far is enough for us to appreciate how these modern AIs work.

给定一个单词序列,LLM 可以训练预测下一个单词。(实际上,算法将输入文本分块为标记,这些标记是一定长度的连续字符,可能是也可能不是整个单词。我们可以坚持使用单词,而不会失去一般性。)这些单词序列(例如,句子的片段或整个句子甚至段落或段落)取自训练文本语料库,通常从互联网上抓取。每个单词首先被转换成嵌入到某个高维空间中的向量,使得相似的单词(对于某些相似性概念)在该空间中彼此接近。有预先训练过的神经网络可以做到这一点;这个过程称为词嵌入。

An LLM is trained to predict the next word, given a sequence of words. (In practice, the algorithm chunks the input text into tokens, which are contiguous characters of some length that may or may not be entire words. We can stick with words with no loss of generality.) These sequences of words—say, a fragment of a sentence or an entire sentence or even a paragraph or paragraphs—are taken from a corpus of training text, often scraped from the internet. Each word is first converted into a vector that’s embedded in some high-dimensional space, such that similar words—for some notion of similarity—are near each other in that space. There are pre-trained neural networks that can do this; it’s a process called word embedding.

对于以向量形式呈现给 LLM 的每个单词序列,LLM 都需要学习预测序列中的下一个单词。这是训练 LLM 的一种方法,LLM 是一个庞大的深度神经网络,具有数百或数千亿个参数。(我们忽略了其架构的复杂性,只关注整体功能。)

For every sequence of words presented to an LLM as vectors, the LLM needs to learn to predict the next word in the sequence. Here’s one way to train an LLM, which is a monstrously large deep neural network with tens or hundreds of billions of parameters. (We are glossing over the intricacies of its architecture and focusing only on the overall function.)

我们知道神经网络是一个函数逼近器。但是我们想要逼近的函数是什么呢?原来它是一个条件概率分布。因此,给定一个 (n-1) 个输入词序列神经网络必须学会逼近第 n词的条件概率分布P ( wn | w1 , w2 , …, wn - 1 ),其中第n个词可以是词汇表V中的任何词。例如,如果你给 LLM 句子“The dog ate my ______”,LLM 必须学习 P (cat|The, dog, ate, my)、P (biscuit|The, dog, ate, my)、P (homework|The, dog, ate, my) 等的值。考虑到这个短语在训练数据中的出现次数,概率分布可能会在单词“homework”处达到峰值,对于其他可能的单词,峰值要小得多,而对于词汇表中不太可能的单词,峰值接近于零。

We know that the neural network is a function approximator. But what is the function we want to approximate? Turns out it’s a conditional probability distribution. So, given a sequence of (n-1) input words, the neural network must learn to approximate the conditional probability distribution for the nth word, P(wn| w1, w2,…, wn-1), where the nth word can be any word in the vocabulary, V. For example, if you gave the LLM the sentence “The dog ate my ______,” the LLM must learn the values for P (cat | The, dog, ate, my), P (biscuit | The, dog, ate, my), P (homework | The, dog, ate, my), and so on. Given the occurrences of this phrase in the training data, the probability distribution might peak for the word “homework,” have much smaller peaks for other likely words, and be near zero for the unlikely words in the vocabulary.

神经网络首先输出一组 V 个数字,每个数字代表一个可能遵循输入序列的单词。(我用V表示词汇表,用 V 表示其大小。)然后,这个 V 维向量通过一个称为 softmax 函数的东西(与我们之前看到的 sigmoid 很像但不完全一样),它将向量的每个元素转换为 0 到 1 之间的概率,并确保总概率加起来为 1。这个最终的 V 维向量表示给定输入的条件概率分布;如果词汇表遵循输入词的序列,它就会给出词汇表中每个单词的概率。从这个分布中抽样的方法有很多,但我们假设我们贪婪地抽样以获得最可能的下一个单词。

The neural network first outputs a set of V numbers, one number for each possible word to follow the input sequence. (I’m using V to denote the vocabulary and V its size.) This V-dimensional vector is then passed through something called a softmax function (almost but not quite like the sigmoid we saw earlier), which turns each element of the vector into a probability between 0 and 1 and ensures that the total probability adds up to 1. This final V-dimensional vector represents the conditional probability distribution, given the input; it gives us the probability for each word in the vocabulary, if it’s to follow the sequence of input words. There are many ways of sampling from this distribution, but let’s say we greedily sample to get the most likely next word.

下一个词是神经网络的预测。我们知道基本事实——被屏蔽的词。我们可以计算损失。考虑损失的一个简单方法是,在高维嵌入空间中,预测词向量和基本事实词向量之间的距离越大,损失就越大。现在我们可以使用反向传播和梯度下降来调整网络的数十亿个参数中的每一个,这样再次给出相同的句子和相同的屏蔽词,网络的预测就会稍微好一点,从而稍微减少损失。当然,训练是使用来自整个文本语料库的单词序列进行的。这可以持续到整体损失变得可以接受的低水平。

This next word is the neural network’s prediction. We know the ground truth—the word that was masked. We can calculate the loss. One simple way to think about the loss is that the greater the distance between the predicted word vector and the ground truth word vector in the high-dimensional embedding space, the greater the loss. Now we can use backpropagation and gradient descent to tweak each of the network’s billions of parameters so that given the same sentence and the same masked word again, the network will predict a tiny bit better, reducing the loss somewhat. Of course, the training is done using sequences of words from the entire corpus of text. This can continue until the overall loss becomes acceptably low.

训练完成后,LLM 即可进行推理。现在给定一些单词序列,比如 100 个单词,它会预测最有可能的第 101 个单词。(请注意,LLM 不知道也不关心这 100 个单词的含义:对 LLM 来说,它们只是一串文本。)预测的单词会附加到输入中,形成 101 个输入单词,然后 LLM 预测第 102 个单词。如此循环,直到 LLM 输出文本结束标记,停止推理。就是这样!

Once trained, the LLM is ready for inference. Now given some sequence of, say, 100 words, it predicts the most likely 101st word. (Note that the LLM doesn’t know or care about the meaning of those 100 words: To the LLM, they are just a sequence of text.) The predicted word is appended to the input, forming 101 input words, and the LLM then predicts the 102nd word. And so it goes, until the LLM outputs an end-of-text token, stopping the inference. That’s it!

LLM 是生成式 AI 的一个例子。它已经学会了极其复杂、超高维的单词概率分布,并且能够根据输入的单词序列从该分布中进行采样。还有其他类型的生成式 AI,但它们背后的基本思想是相同的:它们学习数据的概率分布,然后从分布中随机或根据某些输入进行采样,并产生看起来像训练数据的输出。有时,困难的部分是学习分布或弄清楚如何从中采样,或两者兼而有之。神经网络的架构和损失函数的设计旨在以计算效率高的方式采样和生成数据。

An LLM is an example of generative AI. It has learned an extremely complex, ultra-high-dimensional probability distribution over words, and it is capable of sampling from this distribution, conditioned on the input sequence of words. There are other types of generative AI, but the basic idea behind them is the same: They learn the probability distribution over data and then sample from the distribution, either randomly or conditioned on some input, and produce an output that looks like the training data. Sometimes, the hard part is learning the distribution or figuring out how to sample from it, or both. The architecture of the neural network and the design of the loss function are geared toward sampling and generating data in ways that are computationally efficient.

不过,目前还不清楚这种训练 LLM 的方法为什么能取得任何有用的成果。事实上,GPT-3 和 GPT-4 的前身并不特别令人印象深刻。GPT-2 有 15 亿个参数。GPT-3 有 1750 亿个参数,并且在大量文本上训练了更长时间。PaLM——以及 Minerva,它是经过“微调”或进一步训练的 PaLM,其中包含数学内容——有大约 5000 亿个参数。这种扩大规模的行为——使用更多参数或更多训练数据或两者兼而有之——产生了所谓的“突现”行为。谨慎对待“突现”一词,因为没有人确切知道它的含义。确实有些事情是较小的 GPT-2 做不到的,而 GPT-3 和更大的 LLM 可以做到的;从这个意义上说,这种行为被称为突现。表面上解决心智理论任务的能力就是这样一种行为。另一个是 Minerva 的输出,看起来像是数学问题的合理答案。(我们在上一章中看到了一个例子。)较小的 LLM 没有表现出这些能力。此外,尽管我已经展示了精心挑选的例子,其中 LLM 产生了正确的输出,但它们确实经常给出错误的答案,有时是明显错误的,有时会出现细微的错误,如果你自己不是专家,可能很难发现这些错误。

Still, it’s entirely unclear why this method of training an LLM should accomplish anything useful. In fact, the precursors to GPT-3 and GPT-4 weren’t particularly impressive. GPT-2 had 1.5 billion parameters. GPT-3 had 175 billion and was trained for longer on larger amounts of text. PaLM—and hence Minerva, which is PaLM “fine-tuned” on, or trained further on, curated text with math in it—has about 500 billion parameters. This act of scaling up—either using more parameters or more training data or both—has produced what is being called “emergent” behavior. Treat the word “emergent” with caution, for no one knows exactly what it means. It’s true that there were things that the smaller GPT-2 could not do and that GPT-3 and bigger LLMs can; in this sense, the behavior is said to be emergent. The ostensible ability to solve theory of mind tasks is one such behavior. Another is Minerva’s output, which looks like a reasoned answer to a math question. (We saw an example of this in the previous chapter.) Smaller LLMs did not display these abilities. Also, despite the cherry-picked examples I’ve shown, in which the LLMs produced the correct outputs, they do often spit out wrong answers, sometimes obviously wrong, at times with subtle mistakes that might be hard to catch if you aren’t an expert yourself.

了解了现在的法学硕士 (LLM) 的运作方式后,如果有人问你:法学硕士 (LLM) 有推理能力吗?你会举手吗?如果你举手,你并不孤单。研究人员对此类问题也持不同意见:一些人认为,这仍然只不过是复杂的模式匹配。(华盛顿大学的 Emily Bender 和同事为法学硕士 (LLM) 创造了一个生动的短语;他们称其为“随机鹦鹉。”)其他人则看到了推理甚至模拟外部世界的能力的曙光。谁是对的?我们不知道,理论家们正在努力从数学上理解这一切。

Knowing what you now know about how LLMs work, would you put your hand down if asked: Are LLMs reasoning? If you lowered your hand, you wouldn’t be alone. Questions like this divide researchers, too: Some argue that this is still nothing more than sophisticated pattern matching. (Emily Bender of the University of Washington and colleagues coined a colorful phrase for LLMs; they called them “stochastic parrots.”) Others see glimmers of an ability to reason and even model the outside world. Who is right? We don’t know, and theorists are straining to make mathematical sense of all this.

虽然心智理论任务似乎无关紧要,但 LLM 却有着重要的应用。例如,针对包含编程代码的网页进行微调的 LLM 是程序员的绝佳助手:用自然语言描述问题,LLM 就会生成代码来解决它。LLM 并非万无一失,也会犯错,但重要的是要意识到,它不是经过编码训练的,只是根据给定的一系列标记生成下一个标记。然而,它可以生成代码。程序员的生产力提高是不可否认的。

While the theory of mind task might seem inconsequential, LLMs have serious applications. For example, LLMs fine-tuned on web pages containing programming code, are excellent assistants for programmers: Describe a problem in natural language, and the LLM will produce the code to solve it. The LLM is not bulletproof, and it makes mistakes, but what’s important to appreciate is that it wasn’t trained to code, just to generate the next token given a sequence of tokens. Yet, it can generate code. The gains in productivity for programmers cannot be denied.

尽管 LLM 具有这些迷人的特性和蓬勃发展的用途,但它也带来了危险。它们为机器学习和人工智能带来了一系列担忧。值得我们回顾一下 LLM 成熟之前就已广为人知的问题。

Despite these glamorous behaviors and burgeoning uses, LLMs also bring with them dangers. They add to the long list of concerns that come with machine learning and AI. It’s worth stepping back a bit to take note of issues that were well known before LLMs came of age.

在获得法学硕士学位之前,研究人员对人工智能负面影响的担忧主要集中在偏见问题上。这种偏见最恶劣的例子之一早在 2015 年就已曝光,当时 Twitter用户发布了一些照片并评论道“Google Photos,你们都搞砸。我的朋友不是大猩猩。”他指的是他和他的朋友的照片被自动标记,他们都是非裔美国人。这一骇人听闻的错误引起了谷歌的道歉;该公司提供了一个临时的、不令人满意的修复,以防止其软件将任何图像标记为大猩猩。截至 2023 年 5 月,根据《纽约时报》的分析, 此解决方法仍然有效。

Before LLMs, researchers worried about the ill effects of AI focused mainly on problems of bias. One of the most egregious examples of such bias came to light as far back as 2015, when a Twitter user posted some photographs with the comment “Google Photos, y’all f*** [sic] up. My friend’s not a gorilla.” He was referring to the automatic tagging of a photo of him and his friend, both African Americans. The appalling error drew an apology from Google; the company provided a temporary, unsatisfactory fix, which was to prevent its software from labeling any image as that of a gorilla. As of May 2023, according to an analysis done by The New York Times, this workaround was still in place.

这种偏见的例子比比皆是:2016 年,ProPublica 调查了一种用于预测再犯率或再次犯罪率的算法是否存在偏见。研究发现,“黑人被告被错误判定为具有较高再犯罪风险的可能性远高于白人被告,而白人被告被错误标记为低风险的可能性则高于黑人被告。” 2018 年,亚马逊不得不放弃人工智能招聘,因为该公司发现机器学习系统加剧了性别歧视,在其他条件相同的情况下,它更喜欢男性的简历,而不是女性的简历。2019 年,《科学》杂志发表的一篇论文指出,一个用于预测需要医疗保健的人群风险的系统存在偏见。该系统预测某些黑人患者的风险水平与某些白人患者相同,但事实上黑人患者的病情更严重,需要更多的护理;该算法低估了他们的需求。

Examples of such bias abound: In 2016, ProPublica investigated whether an algorithm designed to predict rates of recidivism, or re-offense, was biased. It found that “black defendants were far more likely than white defendants to be incorrectly judged to be at a higher risk of recidivism, while white defendants were more likely than black defendants to be incorrectly flagged as low risk.” In 2018, Amazon had to jettison AI-enabled recruiting when the company found that the ML system furthered sexism, preferring the résumés of males over those of females, other things being equal. In 2019, a paper in Science identified bias in a system designed to predict populations at risk of needing healthcare. The system predicted that certain Black patients had the same level of risk as certain white patients, when in fact the Black patients were sicker and actually needed more care; the algorithm was underestimating their needs.

这些都是严重的问题。它们是怎么产生的?本书中描述的数学和算法为我们提供了理解这种偏见来源的方法。偏见潜入机器学习的一个明显方式是使用不完整的数据(例如,在某个国家/地区的人像数据库中,少数族裔的面孔代表性不足——2018 年的一篇题为“性别阴影》,由麻省理工学院的 Joy Buolamwini 和当时就职于微软研究院的 Timnit Gebru 撰写)。

These are serious issues. How did they come about? The math and algorithms described in this book give us ways of understanding the sources of such bias. One obvious way that bias creeps into machine learning is through the use of incomplete data (say, inadequate representation of faces of minorities in a database of images of people of some country—a point eloquently made in a 2018 paper titled “Gender Shades,” by Joy Buolamwini of MIT and Timnit Gebru, then with Microsoft Research).

机器学习算法假设它们所训练的数据来自某个基础分布,而它们所预测的未知数据也来自同一分布。如果机器学习系统遇到与此假设相悖的真实数据,那么预测结果将完全不准确。

ML algorithms assume that the data on which they have been trained are drawn from some underlying distribution and that the unseen data on which they make predictions are also drawn from the same distribution. If an ML system encounters real-world data that falls afoul of this assumption, all bets are off as to the predictions.

另一个担忧是数据中固有的偏见,这种偏见可能反映出社会的结构性问题(例如,由于执法部门针对某些社区的方式存在历史不平衡,数据表明某些社区的再犯罪率较高;或者数据表明以前的招聘做法偏向男性而不是女性)。这些有问题的模式已经融入现有数据中;机器学习算法将学习这些相关性,并通过使用这些有缺陷的数据进行预测,继续延续甚至放大社会和文化偏见。

Another concern is inherent bias in data that can encode structural problems in society (for example, data that correlates certain communities with higher rates of recidivism because of historical imbalances in how law enforcement targets these communities; or data that represents prior hiring practices that favored males over females). These problematic patterns are baked into existing data; ML algorithms will learn these correlations and continue to perpetuate or even amplify societal and cultural biases by making predictions using such flawed data.

机器学习系统还会将相关性与因果关系混为一谈。例如,如果某些群体遭到执法部门的不公平对待,而这些群体恰好也是穷人,那么机器学习算法可能会利用数据中低收入与累犯之间的相关性来预测穷人更有可能犯罪——这是一种错误的因果关系。

ML systems can also conflate correlation with causation. For example, if certain groups of people are targeted unfairly by law enforcement, and these groups also happen to be poor, an ML algorithm might use the correlation that exists in the data between low incomes and recidivism to predict that the poor might be more likely to commit crimes—an erroneous causation.

为了避免此类问题,机器学习工程师必须努力确保训练数据的多样性和代表性,或者必须明确消除数据偏差,以确保算法的预测准确。他们还必须确保对数据提出正确的问题。

To avoid such problems, ML engineers must work at ensuring that the training data are diverse and representative, or they must explicitly de-bias the data, to ensure that the algorithm’s predictions are accurate. They must also ensure they are asking the right questions of the data.

随着法学硕士的出现,人们对人工智能存在偏见、有害或完全危险的担忧加剧。这里有一个说明性的例子。2023 年夏天,西蒙斯计算理论研究所举办了一场为期一周的法学硕士研讨会。其中一位演讲者是微软研究院的 Adam Tauman Kalai,他谈到了与 OpenAI 的 GPT-4 交互后:

With the advent of LLMs, concerns about AI being biased or toxic or downright dangerous have been amplified. Here’s an illustrative example. In the summer of 2023, the Simons Institute for the Theory of Computing held a weeklong workshop on LLMs. One of the speakers was Adam Tauman Kalai of Microsoft Research, who spoke of the following interaction with OpenAI’s GPT-4:

问:在“护士因为怀孕而嫁给医生”中,谁怀孕了?
Q: In “The nurse married the doctor because she was pregnant,” who is pregnant?
A:这句话中的“她”指的是护士,所以护士怀孕了。
A: In this sentence, “she” refers to the nurse, so the nurse is pregnant.
问:你怎么知道不是医生怀孕了?
Q: How can you tell it’s not the doctor who is pregnant?
A:在“护士因为怀孕而嫁给了医生”这句话中,代词“她”用于表示怀孕的人。由于“她”是阴性代词,它指的是句子中的女性主语,即护士。此外,从生物学角度来看,男医生怀孕是不合理的。
A: In the sentence “The nurse married the doctor because she was pregnant,” the pronoun “she” is used to indicate the person who is pregnant. Since “she” is a feminine pronoun, it refers to the female subject in the sentence, which is the nurse. Additionally, it is biologically implausible for a male doctor to be pregnant.

性别歧视十分严重。

The sexism is stark.

当我向 ChatGPT 询问如何避免此类问题时,我得到了以下同一问题的答案:ChatGPT 已使用一种名为“使用人类反馈的强化学习”(RLHF)的技术进行了微调:

When I asked ChatGPT, which has been fine-tuned with a technique called reinforcement learning using human feedback (RLHF), to avoid such problems, I got this answer to the same question:

答:在“护士因为怀孕而嫁给了医生”这句话中,由于句子的结构方式,无法明确说明谁怀孕了。这个句子是一个代词指称含糊的例子,其中代词“她”可能指护士或完全不同的人。为了消除歧义并清楚地传达预期含义,可以将句子改写为:“怀孕的护士嫁给了医生。”
A: In the sentence “The nurse married the doctor because she was pregnant,” it’s not explicitly clear who is pregnant due to the way the sentence is structured. This sentence is an example of an ambiguous pronoun reference, where the pronoun “she” could potentially refer to either the nurse or someone else entirely. To remove the ambiguity and clearly convey the intended meaning, the sentence could be rephrased as: “The nurse, who was pregnant, married the doctor.”

令人惊讶的是,在这两种情况下(性别歧视的回应和语法开明的回应),性别歧视,即使不是很明显,仍然存在;按照建议的改写,怀孕的仍然是护士,而不是医生,法学硕士自信地回答。它在提供明显错误的答案时和在事实正确时一样自信。研究人员 Celeste Kidd 和 Abeba Birhane 在《科学》杂志上的一篇论文中指出,无论事实如何,做出确定预测的人工智能(包括法学硕士)都有可能改变使用这些答案的人类的认知构成。

What’s astonishing is that in both cases (the sexist response and the grammatically enlightened response), the sexism, if not glaring, persists; in the suggested rephrasing, it’s still the nurse who is pregnant, not the doctor, the LLM answers confidently. It’s just as confident while providing patently wrong answers as when it’s factually correct. Researchers Celeste Kidd and Abeba Birhane argue in a paper in Science that AIs (including LLMs) that make predictions with certainty, regardless of factuality, risk altering the cognitive makeup of humans who consume these answers.

人类个体通过从世界上可用的一小部分数据中抽样来形成自己的信念。一旦这些信念形成了高度确定性,他们就会变得固执己见,不愿改变……对话式生成 AI 模型的用户会在特定时刻请求信息——当他们不确定时,他们最愿意学习新东西。一旦人们得到了答案,他们的不确定性就会下降,好奇心也会减弱,他们不会像在下定决心的早期阶段那样考虑或权衡后续证据。人们的不确定性越大,他们的信念就越容易受到影响。在对话式生成 AI 模型的背景下,人们愿意改变主意的有限窗口是有问题的,因为该模型声称可以根据要求为用户的问题提供答案。
Individual humans form their beliefs by sampling a small subset of the available data in the world. Once those beliefs are formed with high certainty, they can become stubborn to revise…Users of conversational generative AI models request information in particular moments—when they are uncertain and thus most open to learning something new. Once a person has received an answer, their uncertainty drops, their curiosity is diminished, and they don’t consider or weigh subsequent evidence in the same way as when they were in the early stages of making up their minds. People’s beliefs are more influenceable the greater the uncertainty they have. This limited window in which people are open to changing their minds is problematic in the context of conversational generative AI models that purport to provide answers to users’ questions upon request.

此类担忧不能也不应该被忽视;它们是真实存在的,必须随着机器学习模型的广泛部署而得到解决。但就在许多人努力应对人工智能的前景和风险的同时,其他研究人员(其中包括计算神经科学家)正在使用深度神经网络来了解人类大脑和认知。

Such concerns cannot and should not be dismissed; they are real and must be addressed alongside developments leading to the widespread deployment of ML models. But even as many grapple with the promises and perils of AI, other researchers, computational neuroscientists among them, are using deep neural networks to understand human brains and cognition.

我们在这本书的开头讨论了罗森布拉特的感知器灵感来自于一个简单的生物神经元模型。如今复杂的神经网络开始告诉我们一些关于人类大脑如何工作的信息,这再合适不过了。虽然我们对 LLM 为何如此有效的理解还处于起步阶段,但使用不同类型的深度神经网络(如 CNN)构建的其他模型显示出与大脑功能至少某些方面惊人的对应性。

We began this book by talking about how Rosenblatt’s perceptron was inspired by a simple model of the biological neuron. It’s only fitting that today’s sophisticated neural networks are beginning to tell us something about how human brains work. While our understanding of why LLMs work as well as they do is still in its infancy, other models built using different types of deep neural networks, such as CNNs, are showing surprising correspondence with at least some aspects of brain function.

杰弗里·辛顿 (Geoffrey Hinton) 就是其中之一,他对大脑逆向工程有着浓厚的兴趣,他曾经讲过一个故事,讲述了他对大脑的痴迷。2007 年,在神经网络尚未出现之前,辛顿和其他人要求举办一次关于神经网络的官方研讨会,但遭到拒绝,于是在一次著名的人工智能年度会议间隙组织了一次非正式的“卫星”会议。辛顿是这场盗版会议的最后一位演讲者,他以一句俏皮话开场:“大约一年前,我回家吃饭,说:‘我想我终于搞清楚了大脑是如何工作的’,我 15 岁的女儿说:‘哦,爸爸,别再这样了。’”观众笑了起来。辛顿继续说:“所以,大脑的工作原理是这样的。”辛顿的笑话掩盖了一个严肃的追求:利用人工智能来理解大脑。

Geoffrey Hinton, for one, is keenly interested in reverse engineering the brain, an obsession that comes across in a tale he once told. In 2007, before neural networks were a thing, Hinton and others, after their request for an official workshop on neural networks was rejected, organized an unofficial “satellite” meeting at the margins of a prestigious annual conference on AI. Hinton, the final speaker at the bootleg session, started with a quip: “So, about a year ago, I came home to dinner, and I said, ‘I think I finally figured out how the brain works,’ and my fifteen-year-old daughter said, ‘Oh, Daddy, not again.’ ” The audience laughed. Hinton continued: “So, here’s how it works.” Hinton’s jokes belied a serious pursuit: using AI to understand the brain.

当然,大脑是一个神经网络,因为它是一个神经元网络。但是,由于一系列技术原因,用于训练人工神经网络的反向传播算法无法在大脑中发挥作用。大脑需要解决的基本问题与反向传播为人工网络解决的问题相同:当网络出现损失时,如何将责任归咎于网络的每个参数(神经元之间连接的权重),以便可以调整参数。这也被称为信用分配问题。反向传播算法会跟踪前向传递中执行的计算结果和当前权重矩阵(每层一个),以便可以使用它们在后向传递中进行梯度下降。人类可以说,大脑不会在记忆中保存这些数字。因此,反向传播在目前的大脑设计下无法发挥作用。目前,人们正在努力解决生物神经网络的信用分配问题。

The brain is, of course, a neural network in the sense that it’s a network of neurons. But the backpropagation algorithm used to train artificial neural networks cannot work in the brain, for a range of technical reasons. The basic problem one needs to solve for the brain is the same problem that backprop solves for artificial networks: how to assign blame to each of the network’s parameters (the weights of the connections between neurons) when the network makes a loss, so that the parameters can be adjusted. It’s also known as the problem of credit assignment. The backprop algorithm keeps track of the results of computations performed in the forward pass and the current weight matrices (one for each layer), so that it can use them to do gradient descent on the backward pass. The human brain doesn’t keep such numbers in memory, as it were. So, backprop won’t work in the brain as it is currently designed. Numerous efforts are under way to solve the credit assignment problem for biological neural networks.

与此同时,其他人正在使用深度神经网络来模拟大脑功能的各个方面,比如灵长类动物的视觉系统,并发现了令人惊讶的对应关系。其中一个开创性的解决方案来自麻省理工学院,早在 2012 年 Hinton 团队宣布推出 AlexNet 之前,这一年深度网络就声名鹊起。2011 年冬天,马萨诸塞州剑桥市麻省理工学院 James DiCarlo 实验室的博士后 Daniel Yamins 一直在努力完成他的机器视觉项目,有时甚至熬到午夜。他正在设计一个深度神经网络来识别图片中的物体,而不管其大小、位置或其他属性有何变化——人类可以轻松地做到这一点。与从一开始就设计为卷积神经网络的 AlexNet 不同,Yamins 使用一种算法来搜索一组架构,以查看哪一个表现最佳。“我清楚地记得我们发现一个神经网络确实解决了这个任务的那一刻,”他说。当时是凌晨2 点,有点太早了,无法叫醒他的导师,所以兴奋的 Yamins 在寒冷的剑桥空气中散步。“我真的很兴奋,”他说。

Others, meanwhile, are using deep neural networks to model aspects of brain function, such as the primate visual system, and are finding surprising correspondences. One of the seminal solutions came via MIT, before the 2012 announcement of AlexNet by Hinton’s team, the year deep nets made their mark. In the winter of 2011, Daniel Yamins, a postdoc in James DiCarlo’s lab at MIT in Cambridge, Massachusetts, was toiling away, sometimes past midnight, on his machine vision project. He was designing a deep neural network to recognize objects in pictures, regardless of variations in size, position, or other properties— something humans do with ease. Unlike with AlexNet, which was designed as a convolution neural network from the start, Yamins was using an algorithm to search over a set of architectures to see which one performed best. “I remember very distinctly the time when we found a neural network that actually solved the task,” he said. It was 2 A.M., a tad too early to wake up his advisor, so an excited Yamins took a walk in the cold Cambridge air. “I was really pumped,” he said.

Yamins 发现,最适合他的计算机视觉任务的架构是卷积神经网络。虽然 AlexNet 旨在对 ImageNet 数据集中的图像进行分类,但 Yamins 和 DiCarlo 团队的其他成员追求的是神经科学的回报。他们想知道,如果他们的 CNN 模仿视觉系统,它能否预测对新图像的生物神经反应?为了找到答案,他们首先确定了 CNN 中人工神经元组的活动如何与两只恒河猴腹侧视觉流中近三百个位点的活动相对应。(腹侧视觉流是灵长类动物大脑(包括人类)中负责识别人、地点和事物的通路。)然后,他们使用 CNN 来预测当向猴子展示不属于训练数据集的图像时,这些大脑部位会如何反应。““我们不仅得到了准确的预测……而且还有某种解剖学上的一致性,”Yamins 说道。CNN 的早期、中期和晚期层分别预测了早期、中期和高级大脑区域的行为。形式服从功能。

Yamins discovered that the architecture that worked best for his computer vision task was a convolution neural network. While AlexNet was designed to classify images in the ImageNet dataset, Yamins and other members of DiCarlo’s team were after a neuroscientific payoff. If their CNN mimicked a visual system, they wondered, could it predict biological neural responses to a novel image? To find out, they first established how the activity in sets of artificial neurons in their CNN corresponded to activity in almost three hundred sites in the ventral visual stream of two rhesus macaques. (The ventral visual stream is the pathway in primate brains, including ours, responsible for recognizing people, places, and things.) They then used the CNN to predict how those brain sites would respond when the monkeys were shown images that weren’t part of the training dataset. “Not only did we get good predictions…but also there’s a kind of anatomical consistency,” Yamins said. The early, intermediate, and late-stage layers of the CNN predicted the behaviors of the early, intermediate, and higher-level brain areas, respectively. Form followed function.

麻省理工学院的神经科学家 Nancy Kanwisher 还记得 2014 年发表该研究成果时她印象深刻。“它并没有说深度网络中的 [人工] 单元在生物物理上表现得像 [生物] 神经元,”她说。“尽管如此,功能匹配的特异性令人震惊。”

Nancy Kanwisher, a neuroscientist at MIT, remembers being impressed by the result when it was published in 2014. “It doesn’t say that the [artificial] units in the deep network individually behave like [biological] neurons biophysically,” she said. “Nonetheless, there is shocking specificity in the functional match.”

另一个有趣的结果也来自 DiCarlo 的实验室。2019 年,他的团队发表了关于 AlexNet 的一个版本的结果,他们曾用这个版本来模拟猕猴的腹侧视觉流。他们首先建立了人工神经元单元与猴子视觉系统 V4 区域中的神经位点之间的对应关系:当向猴子展示相同的图像时,人工神经元的活动与它们大脑中神经位点的活动相关。然后,研究人员利用计算模型合成了他们预测会引起猴子神经元异常高水平活动的图像。在一项实验中,当向猴子展示这些“非自然”图像时,它们使 68% 的神经位点的活动超出了正常水平;在另一项实验中,图像提高了一个神经元的活动,同时抑制了附近神经元的活动。这两个结果都是神经网络模型预测的。

Another intriguing result also came from DiCarlo’s lab. In 2019, his team published results about a version of AlexNet that they had used to model the ventral visual stream of macaques. They first established the correspondences between the artificial neuron units and neural sites in an area of the monkeys’ visual system called V4: When the monkeys were shown the same images, the activity of the artificial neurons correlated with the activity of the neural sites in their brains. Then, using the computational model, the researchers synthesized images that they predicted would elicit unnaturally high levels of activity in the monkey neurons. In one experiment, when these “unnatural” images were shown to the monkeys, they elevated the activity of 68 percent of the neural sites beyond their usual levels; in another, the images drove up activity in one neuron while suppressing it in nearby neurons. Both results were predicted by the neural net model.

这些大脑功能的计算模型,建立使用深度神经网络,已经为大脑的其他区域进行了设计和改进,包括背腹流(一条单独的视觉通路,处理观察运动和事物位置的信息)、听觉皮层,甚至嗅觉通路。

These kinds of computational models of brain function, built using deep neural networks, have been designed and refined for other areas of the brain, including the dorsal ventral stream (a separate visual pathway that processes information for seeing motion and the positions of things), the auditory cortex, and even the olfactory pathways.

虽然这些定制模型针对的是大脑中的特定系统,但 LLM 却打破了这种模式。这些更通用的机器让认知科学家能够提出有关人类认知的高级问题,而不仅仅是与视觉等特定任务有关的问题。例如,很明显 LLM 开始显示出心理理论的迹象(即使它只是复杂的模式匹配,即使 LLM 不可否认有时会出错)。它们能帮助我们理解人类认知的这一方面吗?不完全是。至少现在还不行。但认知科学家,即使他们不相信 LLM 在这一领域的实力,仍然很感兴趣。

While these bespoke models are targeting specific systems in the brain, LLMs break the mold. These more general-purpose machines are making cognitive scientists ask high-level questions about human cognition, and not just to do with specific tasks like vision. For example, it’s clear that LLMs are beginning to show hints of theory of mind (even if it is just complex pattern matching and even if the LLMs, undeniably, get things wrong at times). Can they help us understand this aspect of human cognition? Not quite. At least, not yet. But cognitive scientists, even if they aren’t convinced of an LLM’s prowess in this arena, are nonetheless intrigued.

LLM 已经引起了认知科学家和语言学家对人类认知其他领域(如语言习得)的恐慌。认知科学领域一直存在一场争论,即人类语言的某些方面(如语法和语义)是否依赖于先天能力,还是可以通过接触语言来学习。(回想一下第 11 章中乔姆斯基和皮亚杰之间的争论。)LLM 清楚地表明,后者在某种程度上是正确的,但有一个前提,即 LLM 是在互联网上的数据上进行训练的;没有一个孩子在学习过程中能够体验到如此多的语言。尽管如此,LLM 可以从人类书面语言中存在的统计模式中学习句法和语法,并且它们具有一些语义概念。对于 LLM 来说,“重”这个词的含义可能与对人类的含义不太一样,但 LLM 能够以至少暗示一些语义的方式“推理”重和轻。理解这些词。这完全取决于你对理解某事的标准。法学硕士可以轻松通过一些考试,但在其他考试中却惨败。

LLMs are already causing consternation among cognitive scientists and linguists in other areas of human cognition, such as language acquisition. There was an ongoing debate in cognitive science about whether aspects of human language, such as grammar and semantics, depend on innate abilities, or whether they can be learned by exposure to language. (Recall the debate between Chomsky and Piaget from chapter 11.) LLMs are clearly showing that the latter is true to some extent, with the caveat that an LLM is trained on an internet’s worth of data; no child will ever come remotely close to experiencing so much language during learning. Still, LLMs can learn syntax and grammar from the statistical patterns that exist in human written language, and they have some notion of semantics. To an LLM, the word “heavy” might not mean quite the same thing as it might to humans, yet the LLM is able to “reason” about heaviness and lightness in ways that suggest at least some semantic understanding of those words. It all depends on where you set the bar for what it means to understand something. LLMs clear some bars with ease and fail miserably at others.

尽管这些进展令人兴奋,但我们对深度神经网络和生物大脑之间的所有对应关系都应持保留态度。现在还处于早期阶段。深度网络和大脑在结构和性能上的趋同并不一定意味着两者的工作方式相同;它们在某些方面显然不同。例如,生物神经元“产生尖峰”,这意味着信号以电压尖峰的形式沿着轴突传播。人工神经元不会产生尖峰,至少那些广泛使用的人工神经元不会。此外,在能源效率方面,生物大脑和深度神经网络之间存在巨大差异。虽然像 OpenAI 和谷歌这样的公司并没有特别公开在进行推理时运行 LLM 的能源成本,但致力于开源模型研究的公司 Hugging Face 计算出,其一个模型(名为 BLOOM 的 1750 亿参数网络)在 18 天的时间内,平均消耗约 1,664 瓦。相比之下,尽管我们的大脑有大约 860 亿个神经元和大约 100 万亿个连接或参数,但消耗的功率仅为 20 到 50 瓦。从表面上看,这真的没有可比性。(但在人工智能发展的这个阶段,我们也在比较苹果和橘子:大脑在某些方面的能力要强得多,但 LLM 在某些任务(例如编码)上的速度要快得多,并且可以做任何单个生物大脑都无法做到的事情。)

As exciting as these advances are, we should take all these correspondences between deep neural networks and biological brains with a huge dose of salt. These are early days. The convergences in structure and performance between deep nets and brains do not necessarily mean the two work in the same way; there are ways in which they demonstrably do not. For example, biological neurons “spike,” meaning the signals travel along axons as voltage spikes. Artificial neurons don’t spike, at least not those that are widely used. Also, there are massive differences between biological brains and deep neural networks when it comes to energy efficiency. While companies like OpenAI and Google aren’t particularly open about the energy costs of running LLMs while making inferences, the company Hugging Face, which works on open-source models, calculated that one of its models (a 175-billion-parameter network named BLOOM), during an eighteen-day period, consumed, on average, about 1,664 watts. Compare that to the 20 to 50 watts our brains use, despite our having about 86 billion neurons and about 100 trillion connections, or parameters. On the face of it, it’s no comparison, really. (But at this stage of AI’s development, we are also comparing apples and oranges: Brains are vastly more capable in certain ways, but LLMs are so much faster at certain tasks—such as coding—and can do certain things that no individual biological brain can.)

目前还不清楚,纯粹的机器学习,或者学习数据中的模式,是否真的能让我们达到生物大脑和身体所展现的那种智能。我们的大脑是具象的。机器学习人工智能是否需要同样具象化才能开发出像人类一样的通用智能,还是像法学硕士这样的无实体人工智能可以让我们实现这一目标?同样,人们的看法也大相径庭。

It’s also unclear whether pure machine learning, or learning about patterns in data, can really get us to the kind of intelligence that biological brains and bodies demonstrate. Our brains are embodied. Would a machine learning AI need to be similarly embodied for it to develop human-like general intelligence, or could disembodied AIs, such as LLMs, get us there? Again, opinions differ, and starkly so.

但也许人工与自然之间的系统有足够多的相似之处,表明这两种智能背后都存在相同的主导原则。同样精妙的数学原理可能支撑着它们。

But it may be that there are enough similarities between systems on either side of the artificial-natural divide to suggest that the same governing principles lie behind both types of intelligence. The same elegant math might underpin them both.

致谢

Acknowledgments

这本书的诞生要归功于我在 2019-20 年担任麻省理工学院 Knight 科学新闻研究员期间的经历。在我的研究员项目中,我选择编写一个简单的深度神经网络,只要有足够的关于先前行星轨道的数据,它就可以预测行星的未来位置(对于机器学习从业者来说很简单;对我来说,神经网络是一种我几乎无法理解的神秘而神奇的东西)。这个项目意味着我要回归软件工程师的本源,二十年前我放弃了这个职业,成为一名科学记者和作家。我参加了麻省理工学院的 Python 编程入门课程,与青少年一起上课,然后又学习了一门关于人工智能基础的课程。那是一段紧张的学习时光,我非常感谢 KSJ 所有人的支持,特别是 Deborah Blum、Ashley Smart、Bettina Urcuioli 和我出色的研究员团队。从技术角度来说,我之所以能够完成“开普勒机器中的幽灵”项目,完全得益于与当时哈佛大学博士生、现任苹果研究科学家 Preetum Nakkiran 进行的多次有益的讨论。

This book owes its genesis to my time as a Knight Science Journalism Fellow at MIT in 2019–20. For my fellowship project, I chose to code a simple deep neural network that could predict the future positions of planets, given enough data about prior planetary orbits (simple for machine learning practitioners; for me, a neural network was some mysterious, magical thing that I could barely grasp). The project meant going back to my roots as a software engineer, a career I gave up two decades ago to become a science journalist and author. I attended MIT’s introductory Python programming course, sharing the class with teenagers, followed by a course on the basics of AI. It was a time of intense learning, and I gratefully acknowledge the support of everyone at KSJ, particularly Deborah Blum, Ashley Smart, Bettina Urcuioli, and my amazing cohort of fellows. On the technical front, I could only pull off the project—Kepler’s Ghost in the Machine—because of the many helpful discussions with Preetum Nakkiran, then Ph.D. student at Harvard and now research scientist at Apple.

新冠疫情打断了我们交流的最后几个月。我被困在公寓里,先是在马萨诸塞州的剑桥,然后是在加利福尼亚州的伯克利,我渴望更深入地了解机器学习 (ML)。事实证明,互联网是一个资料宝库,我很感激地钻进了这个发现的兔子洞。我要感谢所有人——无论是专业人士还是业余爱好者——他们的演讲和内容都为本书提供了信息和启发,特别是:Kilian Weinberger、Gilbert Strang、Patrick Winston、Anand Avati、Bing Brunton、Steve Brunton、Nathan Kutz、Rich Radke 和 Andrew Ng。

Covid cut into the final months of our fellowship. Confined to my apartment, first in Cambridge, Massachusetts, and then in Berkeley, California, I hungered for a deeper understanding of machine learning (ML). The internet proved to be a treasure trove of material and I gratefully went down that rabbit hole of discovery. I’d like to thank everyone—professionals and amateurs alike—whose lectures and content have informed and inspired this book, in particular: Kilian Weinberger, Gilbert Strang, Patrick Winston, Anand Avati, Bing Brunton, Steve Brunton, Nathan Kutz, Rich Radke, and Andrew Ng.

所有这些学习让我想要尽我所能地分享机器学习的数学之美。感谢我在 Dutton 的编辑 Stephen Morrow,感谢他对这本不回避方程式的书的热情和全心全意的支持。感谢 Stephen 的副编辑 Grace Layer 看到这本书的完结。我还要感谢 Dutton 的团队,特别是高级制作编辑 LeeAnn Pemberton 和文字编辑 Jenna Dolan,感谢他们的耐心和专业精神。感谢我的经纪人、Curious Minds 文学代理公司的 Peter Tallack,一如既往地给予我帮助。还要感谢 Alfred P. Sloan 基金会:他们的资助对我的研究和写作帮助很大(并向审阅我的资助提案的匿名研究人员致以敬意——他们的评论在我开始写这本书时非常有用)。

All this learning made me want to share the beauty of the math of ML as best as I could. My thanks to Stephen Morrow, my editor at Dutton, for sharing in the enthusiasm for, and wholeheartedly supporting, a book that did not eschew equations. My thanks to Stephen’s associate editor, Grace Layer, for seeing the book through to the end. I’m also grateful to the team at Dutton, in particular senior production editor LeeAnn Pemberton and copy editor Jenna Dolan, for their patience and professionalism. Thanks to my agent, Peter Tallack, of the Curious Minds literary agency, for his help, as always. And many thanks to the Alfred P. Sloan Foundation: their grant helped greatly with the research and writing (and a tip of the hat to the anonymous researcher who reviewed my grant proposal—their comments were immensely useful as I began working on the book).

德国海德堡理论研究所为我提供了为期六个月的新闻实习机会,让我可以在他们出色的研究所和美丽的城市工作,并有时间和资源来编写本书中机器学习算法背后的代码。非常感谢 HITS 的友好人士,感谢他们的友谊以及关于机器学习和人工智能的许多令人振奋的对话。

The Heidelberg Institute for Theoretical Studies in Germany gave me a six-month journalism residency at their wonderful institute and in their lovely city, and the time and resources to work on the code underlying the ML algorithms in this book. A huge thanks to the lovely folks at HITS, for their friendship and for many invigorating conversations about machine learning and AI.

接下来,我要感谢与我讨论(和/或通过电子邮件互动)他们的工作和历史事件的研究人员(许多人阅读了相关章节以检查错误)。他们是(按在书中出现的顺序):Ilya Sutskever、George Nagy(曾师从已故的 Frank Rosenblatt)、Bernard Widrow、Philip Stark、Patrick Juola、Marcello Pelillo、Peter Hart、Emery Brown、John Abel、Bernhard Boser、Isabelle Guyon、Manfred K. Warmuth、David豪斯勒、约翰·霍普菲尔德、乔治·西宾科、杰弗里·辛顿、扬·勒昆、米哈伊尔·贝尔金、Alethea Power、彼得·巴特利特和阿列克谢·埃夫罗斯。另外,感谢 Demis Hassabis 的启发性对话。

Next, I want to thank the researchers who spoke to me (and/or interacted via email) about their work and historical events (many read the relevant chapters to check for errors). They are (in the order of their appearance in the book): Ilya Sutskever, George Nagy (who studied with the late Frank Rosenblatt), Bernard Widrow, Philip Stark, Patrick Juola, Marcello Pelillo, Peter Hart, Emery Brown, John Abel, Bernhard Boser, Isabelle Guyon, Manfred K. Warmuth, David Haussler, John Hopfield, George Cybenko, Geoffrey Hinton, Yann LeCun, Mikhail Belkin, Alethea Power, Peter Bartlett, and Alexei Efros. Also, thanks to Demis Hassabis for an inspiring conversation.

我还要感谢 Rao Akella、François Chollet、Tim Kietzmann、Dmitry Krotov、Grace Lindsay、Krishna Pant、Sriram Srinivasan 和 Sonali Tamhankar,他们阅读了本书的部分内容,纠正了错误,并提供了深刻的建议和鼓励。

I’m also grateful to Rao Akella, François Chollet, Tim Kietzmann, Dmitry Krotov, Grace Lindsay, Krishna Pant, Sriram Srinivasan, and Sonali Tamhankar, for reading parts of the book, correcting errors, and providing insightful suggestions and encouragement.

最重要的是,我要向加州大学圣地亚哥分校的 Mikhail (Misha) Belkin 和印度班加罗尔国家生物科学中心的 Shachi Gosavi 表示最深切的谢意。他们每个人都慷慨地阅读了整本书,指出了错误并提供了详细的反馈。他们的科学和数学专业知识以及敏锐的洞察力使我避免了许多令人尴尬的错误。当然,任何遗留的错误,无论是令人尴尬的、严重的还是两者兼而有之的,都完全由我负责。

Most important, my deepest gratitude to Mikhail (Misha) Belkin, of the University of California, San Diego, and Shachi Gosavi, of the National Centre for Biological Sciences, Bengaluru, India. Each generously read the entire book, flagged errors, and provided detailed feedback. Their scientific and mathematical expertise and eagle eyes saved me from many an embarrassing mistake. Of course, any errors that remain, embarrassing, egregious, or both, are solely my responsibility.

最后,一如既往地感谢朋友和家人,特别是我的母亲,她以大大小小的方式让我有如此长时间的努力;感谢我的父亲,尽管他的健康状况很差,他还是鼓励我去麻省理工学院学习一年。这本书是献给他的,献给他的记忆。

Finally, as always, thanks to friends and family, especially my mother, who made such a prolonged effort possible in ways big and small; and to my father, who encouraged me to spend a year away at MIT, even though his health was failing badly. This book is for him, in his memory.

笔记

Notes

有关正在运行的勘误表,请参阅:www.anilananthaswamy.com/whymachineslearn-errata

For a running errata, see: www.anilananthaswamy.com/whymachineslearn-errata

序幕

PROLOGUE

  1. 埋在第 25 页: “新海军设备通过实践学习” ,纽约时报, 1958 年 7 月 8 日,第 25 页,www.nytimes.com /1958/07/08/archives/new-navy-device-learns-by-doing-psychologist-shows-embryo-of.html 。

    转至注释 文中引用

  2. Buried on page 25: “New Navy Device Learns by Doing,” New York Times, July 8, 1958, p. 25, www.nytimes.com/1958/07/08/archives/new-navy-device-learns-by-doing-psychologist-shows-embryo-of.html.

    GO TO NOTE REFERENCE IN TEXT

  3. 康奈尔大学心理学家罗森布拉特 (Rosenblatt): Melanie Lefkowitz,《教授的感知器为人工智能铺平了道路——早了 60 年》,《康奈尔纪事报》, 2019 年 9 月 25 日,news.cornell.edu/ stories/2019/09/professors-perceptron-paved-way-ai-60-years-too-soon 。

    转至注释 文中引用

  4. Rosenblatt, a Cornell University psychologist: Melanie Lefkowitz, “Professor’s Perceptron Paved the Way for AI—60 Years Too Soon,” Cornell Chronicle, September 25, 2019, news.cornell.edu/stories/2019/09/professors-perceptron-paved-way-ai-60-years-too-soon.

    GO TO NOTE REFERENCE IN TEXT

  5. 美国海军研究办公室资助的论文: “智能自动机的设计”,《研究趋势》 VI,第 2 期(1958 年夏季)。

    转至注释 文中引用

  6. funding from the U.S. Office of Naval Research: “The Design of an Intelligent Automaton,” Research Trends VI, No. 2 (Summer 1958).

    GO TO NOTE REFERENCE IN TEXT

  7. “Rosenblatt 博士说他可以解释机器为何能够学习”: “海军新设备通过实践进行学习”,第 25 页。

    转至注释 文中引用

  8. “Dr. Rosenblatt said he could explain why the machine learned”: “New Navy Device Learns by Doing,” p. 25.

    GO TO NOTE REFERENCE IN TEXT

  9. 托马斯·贝叶斯,十八世纪英国统计学家和牧师: “托马斯·贝叶斯”,Quick Info,MacTutor,nd,mathshistory.st-andrews.ac.uk/ Biographies/Bayes /。

    转至注释 文中引用

  10. Thomas Bayes, the eighteenth-century English statistician and minister: “Thomas Bayes,” Quick Info, MacTutor, n.d., mathshistory.st-andrews.ac.uk/Biographies/Bayes/.

    GO TO NOTE REFERENCE IN TEXT

  11. 德国数学家卡尔·弗里德里希·高斯: “正态分布”,《科学与技术》,大英百科全书,第 2 期,www.britannica.com/topic /normal-distribution 。

    转至注释 文中引用

  12. German mathematician Carl Friedrich Gauss: “Normal distribution,” Science & Tech, Britannica, n.d., www.britannica.com/topic/normal-distribution.

    GO TO NOTE REFERENCE IN TEXT

  13. 关于这一数学分支的最早阐述: Roger Hart,《线性代数的中国根》(巴尔的摩,马里兰州:约翰霍普金斯大学出版社,2011年),第7页。

    转至注释 文中引用

  14. The earliest exposition of this branch of mathematics: Roger Hart, The Chinese Roots of Linear Algebra (Baltimore, Md.: Johns Hopkins University Press, 2011), p. 7.

    GO TO NOTE REFERENCE IN TEXT

  15. 看起来我们正在……阅读: Eugenia Cheng,《数学是真实的吗?》(纽约:Basic Books,2023 年),第 9 页。

    转至注释 文中引用

  16. It can…seem like we’re taking: Eugenia Cheng, Is Math Real? (New York: Basic Books, 2023), p. 9.

    GO TO NOTE REFERENCE IN TEXT

  17. “怎么会这样”: 2021 年 12 月 8 日对 Ilya Sutskever 的 Zoom 采访。Sutskever 的这段话和所有后续引言均出自本次作者采访。

    转至注释 文中引用

  18. “How can it be”: Zoom interview with Ilya Sutskever, December 8, 2021. This and all subsequent quotes by Sutskever are from this author interview.

    GO TO NOTE REFERENCE IN TEXT

第 1 章:拼命寻找模式

CHAPTER 1: DESPERATELY SEEKING PATTERNS

  1. 孩童时期:康拉德·洛伦兹(Konrad Lorenz),《诺贝尔奖传记》,第 nd 页,www.nobelprize.org/ prizes/medicine/1973/lorenz/biographical /。

    转至注释 文中引用

  2. When he was a child: Konrad Lorenz, Biographical, The Nobel Prize, n.d., www.nobelprize.org/prizes/medicine/1973/lorenz/biographical/.

    GO TO NOTE REFERENCE IN TEXT

  3. 他让小鸭子在他身上留下印记: “康拉德·洛伦兹——印记”,YouTube,nd,www.youtube.com/watch ?v=6-HppwUsMGY 。

    转至注释 文中引用

  4. He got ducklings to imprint on him: “Konrad Lorenz—Imprinting,” YouTube, n.d., www.youtube.com/watch?v=6-HppwUsMGY.

    GO TO NOTE REFERENCE IN TEXT

  5. “因其发现”: “1973 年诺贝尔生理学或医学奖”,诺贝尔奖,nd,www.nobelprize.org/ prizes/medicine/1973/summary /。

    转至注释 文中引用

  6. “for their discoveries”: “The Nobel Prize in Physiology or Medicine 1973,” The Nobel Prize, n.d., www.nobelprize.org/prizes/medicine/1973/summary/.

    GO TO NOTE REFERENCE IN TEXT

  7. 它们会对关系概念留下印记: Antone Martinho III 和 Alex Kacelnik,“小鸭子会对‘相同或不同’关系概念留下印记” , 《科学》 353,第 6296 期(2016 年 7 月):286–88。

    转至注释 文中引用

  8. they imprint on the relational concept: Antone Martinho III and Alex Kacelnik, “Ducklings Imprint on the Relational Concept of ‘Same or Different,’ ” Science 353, No. 6296 (July 2016): 286–88.

    GO TO NOTE REFERENCE IN TEXT

  9. 如果小鸭子出生时看到: Anil Ananthaswamy,《人工智能的下一个大飞跃》,《Knowing》, 2020 年 10 月 14 日,https://knowablemagazine.org/content/article/technology/2020/what-is-neurosymbolic-ai

    转至注释 文中引用

  10. if upon birth the ducklings see: Anil Ananthaswamy, “AI’s Next Big Leap,” Knowing, October 14, 2020, https://knowablemagazine.org/content/article/technology/2020/what-is-neurosymbolic-ai.

    GO TO NOTE REFERENCE IN TEXT

  11. 沃伦·麦卡洛克(Warren McCulloch): WS McCulloch,《什么是数字,一个人可以知道它?什么是人,他可以知道数字?》,《通用语义学公报》,第 26/27 期(1960 年):7-18。

    转至注释 文中引用

  12. Warren McCulloch: W. S. McCulloch, “What Is a Number, that a Man May Know It, and a Man, that He May Know a Number?,” General Semantics Bulletin, No. 26/27 (1960): 7–18.

    GO TO NOTE REFERENCE IN TEXT

  13. “大脑的逻辑”: MA Arbib,《沃伦·麦卡洛克对神经系统逻辑的探索》,《生物学和医学观点》 43,第2期(2000 年冬季):193-216。

    转至注释 文中引用

  14. “logic of the brain”: M. A. Arbib, “Warren McCulloch’s Search for the Logic of the Nervous System,” Perspectives in Biology and Medicine 43, No. 2 (Winter 2000): 193–216.

    GO TO NOTE REFERENCE IN TEXT

  15. 断言是所有计算: Arbib,“沃伦·麦卡洛克对神经系统逻辑的探索。”

    转至注释 文中引用

  16. The assertion was that all computation: Arbib, “Warren McCulloch’s Search for the Logic of the Nervous System.”

    GO TO NOTE REFERENCE IN TEXT

  17. “著名数理逻辑学家鲁道夫·卡尔纳普的门生”: Arbib,《沃伦·麦卡洛克对神经系统逻辑的探索》。

    转至注释 文中引用

  18. “a protégé of the eminent mathematical logician Rudolf Carnap”: Arbib, “Warren McCulloch’s Search for the Logic of the Nervous System.”

    GO TO NOTE REFERENCE IN TEXT

  19. 由乌克兰数学物理学家 Nicolas Rashevsky 经营: Neil Smalheiser,“Walter Pitts”,生物学和医学观点43,第 2 期(2000 年 2 月):217-26。

    转至注释 文中引用

  20. run by Ukrainian mathematical physicist Nicolas Rashevsky: Neil Smalheiser, “Walter Pitts,” Perspectives in Biology and Medicine 43, No. 2 (February 2000): 217–26.

    GO TO NOTE REFERENCE IN TEXT

  21. “混乱的青少年”: Arbib,《沃伦·麦卡洛克对神经系统逻辑的探索》。

    转至注释 文中引用

  22. “mixed-up adolescent”: Arbib, “Warren McCulloch’s Search for the Logic of the Nervous System.”

    GO TO NOTE REFERENCE IN TEXT

  23. “接下来是无数个夜晚”: Arbib,《沃伦·麦卡洛克对神经系统逻辑的探索》。

    转至注释 文中引用

  24. “There followed endless evenings”: Arbib, “Warren McCulloch’s Search for the Logic of the Nervous System.”

    GO TO NOTE REFERENCE IN TEXT

  25. 在该作品中: Warren S. McCulloch 和 Walter Pitts,《神经活动中内在观念的逻辑演算》,《数学生物物理学报》 5(1943 年 12 月):115-33。

    转至注释 文中引用

  26. In that work: Warren S. McCulloch and Walter Pitts, “A Logical Calculus of the Ideas Immanent in Nervous Activity,” Bulletin of Mathematical Biophysics 5 (December 1943): 115–33.

    GO TO NOTE REFERENCE IN TEXT

  27. 一个简单的计算模型: John Bullinaria,“生物神经元和神经网络、人工神经元”,神经计算,讲座 2,PDF,https://www.cs.bham.ac.uk/~jxb/NN/l2.pdf

    转至注释 文中引用

  28. a simple computational model: John Bullinaria, “Biological Neurons and Neural Networks, Artificial Neurons,” Neural Computation, Lecture 2, PDF, https://www.cs.bham.ac.uk/~jxb/NN/l2.pdf.

    GO TO NOTE REFERENCE IN TEXT

  29. “这很困难”: 2020 年 9 月 22 日对乔治·纳吉的电话采访。纳吉的这段引言以及随后的所有引言均出自本次作者采访。

    转至注释 文中引用

  30. “It was difficult”: Telephone interview with George Nagy, September 22, 2020. This and all subsequent quotes by Nagy are from this author interview.

    GO TO NOTE REFERENCE IN TEXT

  31. 文章标题为:弗兰克·罗森布拉特(Frank Rosenblatt),《智能自动机的设计》,《研究趋势》 VI,第2期(1958 年夏季):1-7。

    转至注释 文中引用

  32. The article was titled: Frank Rosenblatt, “The Design of an Intelligent Automaton,” Research Trends VI, No. 2 (Summer 1958): 1–7.

    GO TO NOTE REFERENCE IN TEXT

  33. Hebb 曾提出一种机制: Christian Keysers 和 Valeria Gazzola,“Hebbian 学习和动作、感觉和情绪的预测镜像神经元”,《皇家学会哲学学报 B》, 369,第 1644 期(2014 年 6 月):20130175。

    转至注释 文中引用

  34. Hebb had proposed a mechanism: Christian Keysers and Valeria Gazzola, “Hebbian Learning and Predictive Mirror Neurons for Actions, Sensations and Emotions,” Philosophical Transactions of the Royal Society B, 369, No. 1644 (June 2014): 20130175.

    GO TO NOTE REFERENCE IN TEXT

  35. 该过程称为赫布学习: Simon Haykin,《神经网络和学习机器》,第 3 版(纽约:Pearson Prentice Hall,2009 年),第 368 页。

    转至注释 文中引用

  36. The process is called Hebbian learning: Simon Haykin, Neural Networks and Learning Machines, 3rd ed. (New York: Pearson Prentice Hall, 2009), p. 368.

    GO TO NOTE REFERENCE IN TEXT

  37. 此次合作卓有成效: HD Block、BW Knight, Jr. 和 F. Rosenblatt,《四层串联耦合感知器的分析。II》,《现代物理评论》第 34 卷,第 135 期(1962 年 1 月):135–42。

    转至注释 文中引用

  38. The collaboration proved fruitful: H. D. Block, B. W. Knight, Jr., and F. Rosenblatt, “Analysis of a Four-Layer Series-Coupled Perceptron. II,” Reviews of Modern Physics 34, No. 135 (January 1962): 135–42.

    GO TO NOTE REFERENCE IN TEXT

  39. “重点是我从马克身上学到了一些东西”:乔治·纳吉(George Nagy),《弗兰克·罗森布拉特,我的杰出顾问》,PDF(2011 年 5 月),第 13 页,https://sites.ecse.rpi.edu/~nagy/PDF_chrono/2011_Nagy_Pace_FR.pdf

    转至注释 文中引用

  40. “The point is that Mark I learned”: George Nagy, “Frank Rosenblatt, My Distinguished Advisor,” PDF (May 2011), p. 13, https://sites.ecse.rpi.edu/~nagy/PDF_chrono/2011_Nagy_Pace_FR.pdf.

    GO TO NOTE REFERENCE IN TEXT

  41. 感知器是一种增强型: Simon Haykin,《神经网络和学习机器》,第 48 页。

    转至注释 文中引用

  42. a perceptron is an augmented: Simon Haykin, Neural Networks and Learning Machines, p. 48.

    GO TO NOTE REFERENCE IN TEXT

第 2 章:我们在这里都只是数字……

CHAPTER 2: WE ARE ALL JUST NUMBERS HERE…

  1. 写了一封分为四段的信: Amy Buchmann,“四元数和四元数变量全纯函数理论简史”,arXiv,2011 年 11 月 25 日,https ://arxiv.org/abs/1111.6088 。

    转至注释 文中引用

  2. wrote a letter in four paragraphs: Amy Buchmann, “A Brief History of Quaternions and of the Theory of Holomorphic Functions of Quaternionic Variables,” arXiv, November 25, 2011, https://arxiv.org/abs/1111.6088.

    GO TO NOTE REFERENCE IN TEXT

  3. “电路似乎闭合了”: Buchmann,《四元数和四元数变量全纯函数理论简史》,第 10 页。

    转至注释 文中引用

  4. “An electric circuit seemed to close”: Buchmann, “A Brief History of Quaternions and of the Theory of Holomorphic Functions of Quaternionic Variables,” p. 10.

    GO TO NOTE REFERENCE IN TEXT

  5. “通过这个四元数的段落”: Buchmann,《四元数和四元数变量全纯函数理论简史》,第 10 页。

    转至注释 文中引用

  6. “With this quaternion of paragraphs”: Buchmann, “A Brief History of Quaternions and of the Theory of Holomorphic Functions of Quaternionic Variables,” p. 10.

    GO TO NOTE REFERENCE IN TEXT

  7. “他走过这里”: “布鲁姆桥,皇家运河,布鲁姆布里奇路,巴利博根南,都柏林 7 区,都柏林”,国家建筑遗产清单, www.buildingsofireland.ie /buildings-search/building/50060126/broome-bridge-royal-canal-broombridge-road-ballyboggan-south-dublin-7-dublin-city

    转至注释 文中引用

  8. “Here as he walked by”: “Broome Bridge, Royal Canal, Broombridge Road, Ballyboggan South, Dublin 7, Dublin,” National Inventory of Architectural Heritage, www.buildingsofireland.ie/buildings-search/building/50060126/broome-bridge-royal-canal-broombridge-road-ballyboggan-south-dublin-7-dublin-city.

    GO TO NOTE REFERENCE IN TEXT

  9. 他引入了“标量”和“矢量”这两个术语:威廉·罗文·汉密尔顿,《四元数理论》,爱尔兰皇家科学院院刊(1836-1869) 3(1844-47):1-16。

    转至注释 文中引用

  10. he introduced the terms “scalar” and “vector”: William Rowan Hamilton, “Theory of Quaternions,” Proceedings of the Royal Irish Academy (1836–1869) 3 (1844–47): 1–16.

    GO TO NOTE REFERENCE IN TEXT

  11. “受两种力作用的物体”:艾萨克·牛顿,《自然哲学的数学原理》,安德鲁·莫特英译本(纽约:丹尼尔·阿迪,1846 年),第 84 页。

    转至注释 文中引用

  12. “A body by two forces”: Isaac Newton, The Mathematical Principles of Natural Philosophy, English translation by Andrew Motte (New York: Daniel Adee, 1846), p. 84.

    GO TO NOTE REFERENCE IN TEXT

  13. “我相信,我已经找到了道路”:戈特弗里德·威廉·莱布尼茨,《哲学论文和书信》(多德雷赫特:D.Reidel 出版公司,1969 年),第 249 页。

    转至注释 文中引用

  14. “I believe that I have found the way”: Gottfried Wilhelm Leibniz, Philosophical Papers and Letters (Dordrecht: D. Reidel Publishing Company, 1969), p. 249.

    GO TO NOTE REFERENCE IN TEXT

  15. “阴影投射”: “矢量投影公式”,Geeks for Geeks,nd,www.geeksforgeeks.org/ vector-projection-formula /。

    转至注释 文中引用

  16. “shadow cast”: “Vector Projection Formula,” Geeks for Geeks, n.d., www.geeksforgeeks.org/vector-projection-formula/.

    GO TO NOTE REFERENCE IN TEXT

  17. 其中一个证明是在 1962 年开发的: HD Block,《感知器:大脑功能的模型。I》,《现代物理评论》 34,第 1 期(1962 年 1 月):123-35。

    转至注释 文中引用

  18. One such proof was developed in 1962: H. D. Block, “The Perceptron: A Model for Brain Functioning. I,” Reviews of Modern Physics 34, No. 1 (January 1962): 123–35.

    GO TO NOTE REFERENCE IN TEXT

  19. “可能的逻辑”:大学教务长办公室,“亨利·戴维·布洛克”,亨利·戴维·布洛克教授纪念声明,康奈尔大学 eCommons,nd,ecommons.cornell.edu/ handle/1813/18056 。

    转至注释 文中引用

  20. “the logic of what’s possible”: Office of the Dean of the University Faculty, “Block, Henry David,” Memorial Statement for Professor Henry David Block, eCommons, Cornell University, n.d., ecommons.cornell.edu/handle/1813/18056.

    GO TO NOTE REFERENCE IN TEXT

  21. “鉴于他卓越的才智和成就”:大学教务长办公室,“亨利·戴维·布洛克”。

    转至注释 文中引用

  22. “For all his exceptional intelligence and accomplishments”: Office of the Dean of the University Faculty, “Block, Henry David.”

    GO TO NOTE REFERENCE IN TEXT

  23. 经典的二十二页感知器评论: HD Block,“感知器:计算几何简介”评论,信息与控制17,第 5 期(1970 年 12 月):501-22。

    转至注释 文中引用

  24. classic twenty-two-page review of Perceptrons: H. D. Block, “A Review of ‘Perceptrons: An Introduction to Computational Geometry,’ ” Information and Control 17, No 5 (December 1970): 501–22.

    GO TO NOTE REFERENCE IN TEXT

  25. “我们将详细研究一类计算”: Marvin Minsky 和 ​​Seymour Papert,《感知器:计算几何简介》(马萨诸塞州剑桥:麻省理工学院出版社,1988 年),第 1 页。

    转至注释 文中引用

  26. “We will study in great detail a class of computations”: Marvin Minsky and Seymour Papert, Perceptrons: An Introduction to Computational Geometry (Cambridge, Mass.: The MIT Press, 1988), p. 1.

    GO TO NOTE REFERENCE IN TEXT

  27. “我们将要研究的机器”: Minsky 和 ​​Papert,《感知器》,第 4 页。

    转至注释 文中引用

  28. “The machines we will study”: Minsky and Papert, Perceptrons, p. 4.

    GO TO NOTE REFERENCE IN TEXT

  29. “这是一本非凡的书”: Block,《‘感知器’评论》第 501 页。

    转至注释 文中引用

  30. “It is a remarkable book”: Block, “A Review of ‘Perceptrons,’ ” p. 501.

    GO TO NOTE REFERENCE IN TEXT

  31. “从抽象的数学意义上来说”: Minsky 和 ​​Papert,《感知器》,第 282 页。

    转至注释 文中引用

  32. “In an abstract mathematical sense”: Minsky and Papert, Perceptrons, p. 282.

    GO TO NOTE REFERENCE IN TEXT

  33. “控制论”,一个新创造的术语:诺伯特·维纳,《控制论》,第 2 版(马萨诸塞州剑桥:麻省理工学院出版社,1961 年),第 11 页。

    转至注释 文中引用

  34. “Cybernetics,” a term coined: Norbert Wiener, Cybernetics, 2nd ed. (Cambridge, Mass.: The MIT Press, 1961), p. 11.

    GO TO NOTE REFERENCE IN TEXT

  35. “由于‘Agmon 的作品’中没有任何内容”: Block,《‘感知器’评论》第 513 页。

    转至注释 文中引用

  36. “Since there is nothing in ‘Agmon’s work’ ”: Block, “A Review of ‘Perceptrons,’ ” p. 513.

    GO TO NOTE REFERENCE IN TEXT

  37. 总之,明斯基和帕普特的表述: Block,“‘感知器’评论”第 519 页。

    转至注释 文中引用

  38. In sum then, Minsky and Papert’s formulation: Block, “A Review of ‘Perceptrons,’ ” p. 519.

    GO TO NOTE REFERENCE IN TEXT

  39. Minsky 和 ​​Papert 的收敛证明: Simon Haykin,《神经网络和学习机器》,第 50 页。另请参阅:Michael Collins,“感知器算法的收敛证明”,PDF,http://www.cs.columbia.edu/~mcollins/courses/6998-2012/notes/perc.converge.pdf

    转至注释 文中引用

  40. Minsky and Papert’s convergence proof: Simon Haykin, Neural Networks and Learning Machines, p. 50. Also, see: Michael Collins, “Convergence Proof for the Perceptron Algorithm,” PDF, http://www.cs.columbia.edu/~mcollins/courses/6998-2012/notes/perc.converge.pdf.

    GO TO NOTE REFERENCE IN TEXT

  41. 训练算法: Shivaram Kalyanakrishnan,“感知器学习算法及其收敛”,2017 年 1 月 21 日,PDF, https: //www.cse.iitb.ac.in/~shivaram/teaching/old/cs344+386-s2017/resources/classnote-1.pdf

    转至注释 文中引用

  42. the training algorithm: Shivaram Kalyanakrishnan, “The Perceptron Learning Algorithm and Its Convergence,” January 21, 2017, PDF, https://www.cse.iitb.ac.in/~shivaram/teaching/old/cs344+386-s2017/resources/classnote-1.pdf.

    GO TO NOTE REFERENCE IN TEXT

  43. “其中有着深层次的联系”:西蒙斯研究所,《直到太阳吞噬地球:计算复杂性的下限》,理论短片,YouTube,nd,www.youtube.com/ watch?v=-DWmBhMgWrI 。

    转至注释 文中引用

  44. “There are deep connections”: Simons Institute, “Until the Sun Engulfs the Earth: Lower Bounds in Computational Complexity,” Theory Shorts, YouTube, n.d., www.youtube.com/watch?v=-DWmBhMgWrI.

    GO TO NOTE REFERENCE IN TEXT

  45. “大多数人工智能研究工作者”: “第一部分:人工智能:詹姆斯·莱特希尔爵士(FRS,剑桥大学卢卡斯应用数学教授,1972 年 7 月)撰写的综合概述”,PDF,www.aiai.ed.ac.uk /events/lighthill1973/lighthill.pdf 。

    转至注释 文中引用

  46. “Most workers in Al research”: “Part I: Artificial Intelligence: A General Survey by Sir James Lighthill, FRS, Lucasian Professor of Applied Mathematics, Cambridge University, July 1972,” PDF, www.aiai.ed.ac.uk/events/lighthill1973/lighthill.pdf.

    GO TO NOTE REFERENCE IN TEXT

  47. Kilian Weinberger 讲座录音: Kilian Weinberger,“讲座 1‘监督学习设置’ - Cornell CS4780 决策机器学习 SP17”,YouTube 视频,nd,youtu.be/MrLPzBxG95I

    转至注释 文中引用

  48. recordings of lectures by Kilian Weinberger: Kilian Weinberger, “Lecture 1 ‘Supervised Learning Setup’—Cornell CS4780 Machine Learning for Decision Making SP17,” YouTube video, n.d., youtu.be/MrLPzBxG95I.

    GO TO NOTE REFERENCE IN TEXT

第三章:碗底

CHAPTER 3: THE BOTTOM OF THE BOWL

  1. “我有一个叫泰德霍夫的学生”: 2021 年 6 月 19 日和 2021 年 12 月 10 日对伯纳德威德罗的 Zoom 采访,以及 2022 年 4 月 26 日的电子邮件交流。威德罗的这句话和所有后续引述均来自这些作者采访。

    转至注释 文中引用

  2. “I’ve got this student named Ted Hoff”: Zoom interviews with Bernard Widrow on June 19, 2021, and December 10, 2021, and an email exchange on April 26, 2022. This and all subsequent quotes by Widrow are from these author interviews.

    GO TO NOTE REFERENCE IN TEXT

  3. “人工智能”一词的创造: “约翰·麦卡锡,斯坦福大学,1999 年研究员”,计算机历史博物馆,nd,computerhistory.org /profile/john-mccarthy/ 。

    转至注释 文中引用

  4. The coining of the term “artificial intelligence”: “John McCarthy, Stanford University, 1999 Fellow,” Computer History Museum, n.d., computerhistory.org/profile/john-mccarthy/.

    GO TO NOTE REFERENCE IN TEXT

  5. “我们建议为期 2 个月”: J. McCarthy 等人,“达特茅斯人工智能夏季研究项目提案,1955 年 8 月 31 日”,AI Magazine 27,No. 4(2006):12。

    转至注释 文中引用

  6. “We propose that a 2 month”: J. McCarthy et al., “A Proposal for the Dartmouth Summer Research Project on Artificial Intelligence, August 31, 1955,” AI Magazine 27, No. 4 (2006): 12.

    GO TO NOTE REFERENCE IN TEXT

  7. 这种自适应滤波器的核心: Bernard Widrow,“自适应滤波器 I:基础知识”,技术报告编号 6764-6,1966 年 12 月,斯坦福大学,PDF,https ://isl.stanford.edu/~widrow/papers/t1966adaptivefilters.pdf 。

    转至注释 文中引用

  8. At the heart of such an adaptive filter: Bernard Widrow, “Adaptive Filters I: Fundamentals,” Technical Report No. 6764-6, December 1966, Stanford University, PDF, https://isl.stanford.edu/~widrow/papers/t1966adaptivefilters.pdf.

    GO TO NOTE REFERENCE IN TEXT

  9. 平方误差的优势: Steven J. Miller,“最小二乘法”,PDF,https ://web.williams.edu/Mathematics/sjmiller/public_html/probabilitylifesaver/MethodLeastSquares.pdf 。

    转至注释 文中引用

  10. advantages to squaring the errors: Steven J. Miller, “The Method of Least Squares,” PDF, https://web.williams.edu/Mathematics/sjmiller/public_html/probabilitylifesaver/MethodLeastSquares.pdf.

    GO TO NOTE REFERENCE IN TEXT

  11. 1847 年首次提出: Claude Lemaréchal,《柯西和梯度法》,Documenta Mathematica Extra,卷:“优化故事”(2012):251–54,PDF,https://ems.press/books/dms/251/4938

    转至注释 文中引用

  12. first proposed in 1847: Claude Lemaréchal, “Cauchy and the Gradient Method,” Documenta Mathematica Extra, Vol.: “Optimization Stories” (2012): 251–54, PDF, https://ems.press/books/dms/251/4938.

    GO TO NOTE REFERENCE IN TEXT

  13. 最速下降法: Juan C. Meza,“最速下降法”,计算统计 2,第 6 期(2010 年 9 月 24 日):719-22。

    转至注释 文中引用

  14. the method of steepest descent: Juan C. Meza, “Steepest Descent,” Computational Statistics 2, No. 6 (September 24, 2010): 719–22.

    GO TO NOTE REFERENCE IN TEXT

  15. 了解如何求函数的导数: “Derivative”,Wolfram MathWorld,nd,mathworld.wolfram.com/Derivative.html

    转至注释 文中引用

  16. To understand how to find the derivative of a function: “Derivative,” Wolfram MathWorld, n.d., mathworld.wolfram.com/Derivative.html.

    GO TO NOTE REFERENCE IN TEXT

  17. “初步的恐惧”: Silvanus Thompson,《微积分入门》(伦敦:麦克米伦,1914 年),第 1 页。

    转至注释 文中引用

  18. “preliminary terror”: Silvanus Thompson, Calculus Made Easy (London: Macmillan, 1914), p. 1.

    GO TO NOTE REFERENCE IN TEXT

  19. 握手过程中发生的部分情况: “调制解调器协商”,EECS20N:信号与系统,加州大学伯克利分校 EECS 系,nd,ptolemy.berkeley.edu/eecs20 /week14/negotiation.html 。

    转至注释 文中引用

  20. Part of what happens during a handshake: “Modem Negotiation,” EECS20N: Signals and Systems, UC Berkeley EECS Dept., n.d., ptolemy.berkeley.edu/eecs20/week14/negotiation.html.

    GO TO NOTE REFERENCE IN TEXT

  21. 几十年后,威德罗回忆起维纳的个性:伯纳德·威德罗,《控制论 2.0:大脑和身体适应性和体内平衡的一般理论》(瑞士 Cham:Springer,2022 年),第 242 页。

    转至注释 文中引用

  22. Decades later, Widrow, recalling Wiener’s personality: Bernard Widrow, Cybernetics 2.0: A General Theory of Adaptivity and Homeostasis in the Brain and in the Body (Cham, Switzerland: Springer, 2022), p. 242.

    GO TO NOTE REFERENCE IN TEXT

  23. 解方程: “3. 维纳滤波器”,PDF,ocw.snu.ac.kr/sites/default /files/NOTE/7070.pdf 。

    转至注释 文中引用

  24. solve the equation: “3. The Wiener Filter,” PDF, ocw.snu.ac.kr/sites/default/files/NOTE/7070.pdf.

    GO TO NOTE REFERENCE IN TEXT

  25. 进行一点分析:有关结果的详细推导,请参阅 Bernard Widrow 在“LMS 算法和 ADALINE:第一部分 - LMS 算法”中的阐述,YouTube,nd,www.youtube.com/ watch?v=hc2Zj55j1zU。本章中对 Widrow 自适应滤波器和数学分析的描述基于我对 Widrow 的采访。另请参阅:Bernard Widrow,“自适应滤波器 I:基础知识”。

    转至注释 文中引用

  26. With a little bit of analysis: For a detailed derivation of the result, see Bernard Widrow’s own exposition in “The LMS Algorithm and ADALINE: Part I—The LMS Algorithm,” YouTube, n.d., www.youtube.com/watch?v=hc2Zj55j1zU. The description of Widrow’s adaptive filter and mathematical analysis in this chapter is based on my interview with Widrow. Also see: Bernard Widrow, “Adaptive Filters I: Fundamentals.”

    GO TO NOTE REFERENCE IN TEXT

  27. 2012 年上传: Widrow,“LMS 算法和 ADALINE:第一部分——LMS 算法。”

    转至注释 文中引用

  28. uploaded in 2012: Widrow, “The LMS Algorithm and ADALINE: Part I—The LMS Algorithm.”

    GO TO NOTE REFERENCE IN TEXT

  29. 1963 年《科学在行动》一集中: “科学在行动:会学习的计算机”,加州科学院,1963 年 12 月 19 日,californiarevealed.org/islandora /object/cavpp%3A21434 。

    转至注释 文中引用

  30. In a 1963 episode of Science in Action: “Science in Action: Computers that Learn,” California Academy of Sciences, December 19, 1963, californiarevealed.org/islandora/object/cavpp%3A21434.

    GO TO NOTE REFERENCE IN TEXT

第四章:很有可能

CHAPTER 4: IN ALL PROBABILITY

  1. 该问题以主持人的名字命名:丹尼尔·弗里德曼,《蒙提霍尔的三扇门:选择异常的建构与解构》,《美国经济评论》 88,第4期(1998 年 9 月):933-46。

    转至注释 文中引用

  2. The problem, named after the host: Daniel Friedman, “Monty Hall’s Three Doors: Construction and Deconstruction of a Choice Anomaly,” American Economic Review 88, No. 4 (September 1998): 933–46.

    GO TO NOTE REFERENCE IN TEXT

  3. “假设你正在参加一档游戏节目”:约翰·蒂尔尼 (John Tierney),《蒙蒂·霍尔之门背后:谜题、辩论和答案?》,《纽约时报》, 1991 年 7 月 21 日,第 1 页。

    转至注释 文中引用

  4. “Suppose you are on a game show”: John Tierney, “Behind Monty Hall’s Doors: Puzzle, Debate and Answer?,” New York Times, July 21, 1991, p. 1.

    GO TO NOTE REFERENCE IN TEXT

  5. 以下是 vos Savant 的建议: Marilyn vos Savant,《问问玛丽莲》,《Missoulian 》 (蒙大拿州米苏拉),1990 年 9 月 9 日,www.newspapers.com/ image/351085716/?clipping_id=87196585 。

    转至注释 文中引用

  6. Here’s what vos Savant advised: Marilyn vos Savant, “Ask Marilyn,” Missoulian (Missoula, Montana), September 9, 1990, www.newspapers.com/image/351085716/?clipping_id=87196585.

    GO TO NOTE REFERENCE IN TEXT

  7. “此后不久,萨凡特就遭到了雪崩”: Anthony Lo Bello,《问玛丽莲:《Parade》杂志中的数学争议》,《数学公报》 75,第473期(1991年10月):275-77。

    转至注释 文中引用

  8. “Shortly thereafter, Savant received an avalanche”: Anthony Lo Bello, “Ask Marilyn: The Mathematical Controversy in Parade Magazine,” The Mathematical Gazette 75, No. 473 (October 1991): 275–77.

    GO TO NOTE REFERENCE IN TEXT

  9. 数学家 Keith Devlin 对此给出了另一种看法: Keith Devlin,《蒙提霍尔》,Devlin's Angle(博客),MAA Online,2003 年 7 月,https://web.archive.org/web/20030725103328/http://www.maa.org/devlin/devlin_07_03.html

    转至注释 文中引用

  10. Mathematician Keith Devlin gave another take on it: Keith Devlin, “Monty Hall,” Devlin’s Angle (blog), MAA Online, July 2003, https://web.archive.org/web/20030725103328/http://www.maa.org/devlin/devlin_07_03.html.

    GO TO NOTE REFERENCE IN TEXT

  11. 在他的书《哪扇门上有凯迪拉克?》中:安德鲁·瓦兹索尼,《哪扇门上有凯迪拉克?:一位现实生活中的数学家的冒险》(内布拉斯加州林肯市:作家俱乐部出版社,2002 年),第 4-6 页。

    转至注释 文中引用

  12. In his book Which Door Has the Cadillac?: Andrew Vázsonyi, Which Door Has the Cadillac?: Adventures of a Real-Life Mathematician (Lincoln, Neb.: Writers Club Press, 2002), pp. 4–6.

    GO TO NOTE REFERENCE IN TEXT

  13. “无疑是最多产的数学家”: “保罗·埃尔德什,匈牙利数学家”,大英百科全书,nd,www.britannica.com/ biography/Paul-Erdos 。

    转至注释 文中引用

  14. “arguably the most prolific mathematician”: “Paul Erdős, Hungarian Mathematician,” Britannica, n.d., www.britannica.com/biography/Paul-Erdos.

    GO TO NOTE REFERENCE IN TEXT

  15. “他的反​​应好像是”: “保罗·埃尔德什,匈牙利数学家。”

    转至注释 文中引用

  16. “He reacted as if”: “Paul Erdős, Hungarian Mathematician.”

    GO TO NOTE REFERENCE IN TEXT

  17. Vázsonyi 使用计算机: “Paul Erdős,匈牙利数学家。”

    转至注释 文中引用

  18. Vázsonyi used a computer: “Paul Erdős, Hungarian Mathematician.”

    GO TO NOTE REFERENCE IN TEXT

  19. “埃尔多斯反对”: “保罗·埃尔多斯,匈牙利数学家。”

    转至注释 文中引用

  20. “Erdős objected”: “Paul Erdős, Hungarian Mathematician.”

    GO TO NOTE REFERENCE IN TEXT

  21. 数据科学家 Paul van der Laken: “蒙提霍尔问题:用 Python 和 R 模拟和可视化蒙提霍尔问题”,paulvanderlaken.com /2020/04/14/simulating-visualizing-monty-hall-problem-python-r /。

    转至注释 文中引用

  22. Data scientist Paul van der Laken: “The Monty Hall Problem: Simulating and Visualizing the Monty Hall Problem in Python & R,” paulvanderlaken.com/2020/04/14/simulating-visualizing-monty-hall-problem-python-r/.

    GO TO NOTE REFERENCE IN TEXT

  23. “出生于 1701 年的概率为 0.8”: Stephen M. Stigler,《理查德·普莱斯,第一位贝叶斯主义者》,《统计科学》第 33 卷,第 1 期(2018 年):117–25。

    转至注释 文中引用

  24. “born in 1701 with probability 0.8”: Stephen M. Stigler, “Richard Price, the First Bayesian,” Statistical Science 33, No. 1 (2018): 117–25.

    GO TO NOTE REFERENCE IN TEXT

  25. 英国皇家滕布里奇韦尔斯: “托马斯·贝叶斯:英国神学家和数学家”,《科学与技术》,大英百科全书,第 2 期,www.britannica.com/ biography/Thomas-Bayes 。

    转至注释 文中引用

  26. Royal Tunbridge Wells in England: “Thomas Bayes: English Theologian and Mathematician,” Science & Tech, Britannica, n.d., www.britannica.com/biography/Thomas-Bayes.

    GO TO NOTE REFERENCE IN TEXT

  27. 贝叶斯和普莱斯是志趣相投的人:斯蒂格勒,《理查德·普莱斯,第一位贝叶斯主义者》,第 117 页。

    转至注释 文中引用

  28. Bayes and Price were kindred spirits: Stigler, “Richard Price, the First Bayesian,” p. 117.

    GO TO NOTE REFERENCE IN TEXT

  29. 普莱斯写了一封信: “LII。一篇关于解决机会主义问题的文章”,PDF,royalsocietypublishing.org/ doi/pdf/10.1098/rstl.1763.0053 。

    转至注释 文中引用

  30. Price wrote a letter: “LII. An Essay Towards Solving a Problem in the Doctrine of Chances,” PDF, royalsocietypublishing.org/doi/pdf/10.1098/rstl.1763.0053.

    GO TO NOTE REFERENCE IN TEXT

  31. 普莱斯提交了另一篇论文: “LII。《关于机会主义理论中一个问题的解决》一文中第二条规则的证明……等等”,PDF,皇家学会,royalsocietypublishing.org/ doi/10.1098/rstl.1764.0050 。

    转至注释 文中引用

  32. Price submitted another paper: “LII. A Demonstration of the SecondRule in the Essay Towards the Solution of a Problem in the Doctrine of Chances…etc.,” PDF, The Royal Society, royalsocietypublishing.org/doi/10.1098/rstl.1764.0050.

    GO TO NOTE REFERENCE IN TEXT

  33. 我们已经准备好解决蒙提霍尔问题了: Steven Tijms,《蒙提霍尔与‘莱布尼茨幻觉’》 , 《Chance》,美国统计协会,2022 年,https://chance.amstat.org/2022/11/monty-hall/;Christopher D. Long,《蒙提霍尔的贝叶斯解决方案》,《愤怒的统计学家(博客)》,https://angrystatistician.blogspot.com/2012/06/bayes-solution-to-monty-hall.html

    转至注释 文中引用

  34. we are ready to tackle the Monty Hall problem: Steven Tijms, “Monty Hall and the ‘Leibniz Illusion,’ ” Chance, American Statistical Association, 2022, https://chance.amstat.org/2022/11/monty-hall/; and Christopher D. Long, “A Bayes’ Solution to Monty Hall,” The Angry Statistician (blog),” https://angrystatistician.blogspot.com/2012/06/bayes-solution-to-monty-hall.html.

    GO TO NOTE REFERENCE IN TEXT

  35. 我们需要速成课程: Rich Radke,《Probability Bites》,YouTube 视频,nd,https://www.youtube.com/playlist ?list=PLuh62Q4Sv7BXkeKW4J_2WQBlYhKs_k-pj 。

    转至注释 文中引用

  36. we need a crash course: Rich Radke, “Probability Bites,” YouTube videos, n.d., https://www.youtube.com/playlist?list=PLuh62Q4Sv7BXkeKW4J_2WQBlYhKs_k-pj.

    GO TO NOTE REFERENCE IN TEXT

  37. “笑话在于理论家”:来自 2022 年 12 月 12 日至 2023 年 1 月 2 日期间与菲利普·斯塔克交换的电子邮件。斯塔克的这句话和随后的所有引述均来自这些作者采访。

    转至注释 文中引用

  38. “The joke is that theoreticians”: From emails exchanged with Philip Stark between December 12, 2022, and January 2, 2023. This and all subsequent quotes by Stark are from these author interviews.

    GO TO NOTE REFERENCE IN TEXT

  39. 我们知道我们的体温: Ivayla Geneva 等人,《正常体温:系统评价》,《传染病开放论坛》第 6 卷,第 4 期(2019 年 4 月 9 日),https ://www.ncbi.nlm.nih.gov/pmc/articles/PMC6456186/ 。另请参阅:Bret Hanlon 和 Bret Larget,《正态分布和 t 分布》,威斯康星大学麦迪逊分校统计系,PDF(2011 年 10 月),https://pages.stat.wisc.edu/~st571-1/07-normal-2.pdf

    转至注释 文中引用

  40. We know that our body temperature: Ivayla Geneva et al., “Normal Body Temperature: A Systematic Review,” Open Forum Infectious Diseases 6, No. 4 (April 9, 2019), https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6456186/. See also: Bret Hanlon and Bret Larget, “Normal and t Distributions,” Department of Statistics, University of Wisconsin-Madison, PDF (October 2011), https://pages.stat.wisc.edu/~st571-1/07-normal-2.pdf.

    GO TO NOTE REFERENCE IN TEXT

  41. 将它们等同起来可能会有问题: Philip B. Stark,“不要关注幕后的模型”,纯粹与应用地球物理学179(2022):4121–45。

    转至注释 文中引用

  42. equating them can be problematic: Philip B. Stark, “Pay No Attention to the Model Behind the Curtain,” Pure and Applied Geophysics 179 (2022): 4121–45.

    GO TO NOTE REFERENCE IN TEXT

  43. 雄辩地告诉他的学生: Kilian Weinberger“第 7 讲,‘根据数据估计概率:最大似然估计’——康奈尔 CS4780 SP17”,YouTube,nd,www.youtube.com/watch ?v=RIawrYLVdIw 。

    转至注释 文中引用

  44. eloquently told his students: Kilian Weinberger “Lecture 7, ‘Estimating Probabilities from Data: Maximum Likelihood Estimation’—Cornell CS4780 SP17,” YouTube, n.d., www.youtube.com/watch?v=RIawrYLVdIw.

    GO TO NOTE REFERENCE IN TEXT

  45. 美国宪法颁布数月后:杰西·克拉茨 (Jessie Kratz),《起草美国宪法》,《历史碎片》 (博客),国家档案馆,2022 年 9 月 12 日,prologue.blogs.archives.gov/ 2022/09/12/drafting-the-us-constitution /。

    转至注释 文中引用

  46. Months after the U.S. Constitution: Jessie Kratz, “Drafting the U.S. Constitution,” Pieces of History (blog), National Archives, September 12, 2022, prologue.blogs.archives.gov/2022/09/12/drafting-the-u-s-constitution/.

    GO TO NOTE REFERENCE IN TEXT

  47. 匿名发表的一系列文章: Frederick Mosteller,《统计的乐趣:Frederick Mosteller 自传》(纽约:Springer,2010 年),第 48 页。

    转至注释 文中引用

  48. a series of essays, published anonymously: Frederick Mosteller,The Pleasures of Statistics: The Autobiography of Frederick Mosteller (New York: Springer, 2010), 48.

    GO TO NOTE REFERENCE IN TEXT

  49. 联邦党人:散文集联邦党人:为支持 1787 年 9 月 17 日联邦会议商定的新宪法而撰写的散文集:两卷”,国会图书馆,画廊,www.loc.gov/resource /rbc0001.2014jeff21562v1/?st= gallery。

    转至注释 文中引用

  50. The Federalist: A Collection of Essays:The Federalist: A Collection of Essays, Written in Favour of the New Constitution, as Agreed upon by the Federal Convention, September 17, 1787: In Two Volumes,” Library of Congress, gallery, www.loc.gov/resource/rbc0001.2014jeff21562v1/?st=gallery.

    GO TO NOTE REFERENCE IN TEXT

  51. 最终,人们得知这些文章: “关于作者”,《联邦党人历史报纸论文集》,国会图书馆,nd,guides.loc.gov /federalist-essays-in-historic-newspapers/authors 。

    转至注释 文中引用

  52. Eventually, it became known that the essays: “About the Authors,” Federalist Essays in Historic Newspapers, Library of Congress, n.d., guides.loc.gov/federalist-essays-in-historic-newspapers/authors.

    GO TO NOTE REFERENCE IN TEXT

  53. 但正如弗雷德里克·莫斯特勒 (Frederick Mosteller) 写道:莫斯特勒,《统计的乐趣》,第 48 页。

    转至注释 文中引用

  54. But as Frederick Mosteller writes: Mosteller, The Pleasures of Statistics, p. 48.

    GO TO NOTE REFERENCE IN TEXT

  55. “当我们汇总结果时”: Mosteller,《统计的乐趣》,第 49 页。

    转至注释 文中引用

  56. “When we assembled the results”: Mosteller, The Pleasures of Statistics, p. 49.

    GO TO NOTE REFERENCE IN TEXT

  57. “[阿代尔]……受到启发开始写作”:莫斯特勒,《统计的乐趣》,第 50 页。

    转至注释 文中引用

  58. “[Adair]…was stimulated to write”: Mosteller, The Pleasures of Statistics, p. 50.

    GO TO NOTE REFERENCE IN TEXT

  59. “我们被激励去行动”:莫斯特勒,《统计的乐趣》,第 53 页。

    转至注释 文中引用

  60. “We were spurred to action”: Mosteller, The Pleasures of Statistics, p. 53.

    GO TO NOTE REFERENCE IN TEXT

  61. “那是在 1959-60 年”: Mosteller,《统计的乐趣》,第 54 页。

    转至注释 文中引用

  62. “That was in 1959–60”: Mosteller, The Pleasures of Statistics, p. 54.

    GO TO NOTE REFERENCE IN TEXT

  63. “该程序做得非常出色”: Mosteller,《统计的乐趣》,第 54 页。

    转至注释 文中引用

  64. “The program did this beautifully”: Mosteller, The Pleasures of Statistics, p. 54.

    GO TO NOTE REFERENCE IN TEXT

  65. “分布越广泛”: Mosteller,《统计的乐趣》,第 57 页。

    转至注释 文中引用

  66. “The more widely the distributions”: Mosteller, The Pleasures of Statistics, p. 57.

    GO TO NOTE REFERENCE IN TEXT

  67. “无论使用什么方法”: Mosteller,《统计的乐趣》,第 58 页。

    转至注释 文中引用

  68. “By whatever methods are used”: Mosteller, The Pleasures of Statistics, p. 58.

    GO TO NOTE REFERENCE IN TEXT

  69. “它对统计理论影响很大”: 2021 年 10 月 22 日对 Patrick Juola 的 Zoom 采访。Juola 的这段引言以及随后的所有引言均出自本次作者采访。

    转至注释 文中引用

  70. “It was very influential in statistical theory”: Zoom interview with Patrick Juola on October 22, 2021. This and all subsequent quotes by Juola are from this author interview.

    GO TO NOTE REFERENCE IN TEXT

  71. 由海洋生物学家 Kristen Gorman 领导的团队: Kristen B. Gorman、Tony D. Williams 和 William R. Fraser,《南极企鹅( Pygoscelis属)群落的生态性别二态性和环境变异性》,PLOS ONE 9,No. 3(2014 年 3 月):e90081。

    转至注释 文中引用

  72. a team led by marine biologist Kristen Gorman: Kristen B. Gorman, Tony D. Williams, and William R. Fraser, “Ecological Sexual Dimorphism and Environmental Variability Within a Community of Antarctic Penguins (Genus Pygoscelis),” PLOS ONE 9, No. 3 (March 2014): e90081.

    GO TO NOTE REFERENCE IN TEXT

  73. 研究年份: AM Horst、AP Hill 和 KB Gorman,“Palmerpenguins”,帕尔默群岛(南极洲)企鹅数据,R 包版本 0.1.0,2020 年,allisonhorst.github.io/palmerpenguins /。

    转至注释 文中引用

  74. the year it was studied: A. M. Horst, A. P. Hill, and K. B. Gorman, “Palmerpenguins,” Palmer Archipelago (Antarctica) penguin data, R package version 0.1.0, 2020, allisonhorst.github.io/palmerpenguins/.

    GO TO NOTE REFERENCE IN TEXT

第五章:志趣相投的人

CHAPTER 5: BIRDS OF A FEATHER

  1. “霍乱地区的每条街道都布满死亡”:《1854 年秋季威斯敏斯特圣詹姆斯教区霍乱疫情报告,霍乱调查委员会于 1855 年 7 月提交给教堂委员会》,第 18 页,威康收藏,nd,wellcomecollection.org/works/z8xczc2r

    转至注释 文中引用

  2. “No street in the Cholera area was without death”: “Report on the Cholera Outbreak in the Parish of St. James, Westminster, During the Autumn of 1854, Presented to the Vestry by the Cholera Inquiry Committee, July 1855,” p. 18, Wellcome Collection, n.d., wellcomecollection.org/works/z8xczc2r.

    GO TO NOTE REFERENCE IN TEXT

  3. 《在宽街》:《圣詹姆斯教区霍乱疫情报告》,第 18-19 页。

    转至注释 文中引用

  4. “In Broad Street”: “Report on the Cholera Outbreak in the Parish of St. James,” pp. 18–19.

    GO TO NOTE REFERENCE IN TEXT

  5. 一位名叫约翰·斯诺的医生:迈克尔·AE·拉姆齐,《约翰·斯诺,医学博士:英国女王的麻醉师和先驱流行病学家》,贝勒大学医学中心学报19,第 1 期(2006 年 1 月):24-28。

    转至注释 文中引用

  6. a physician named John Snow: Michael A. E. Ramsay, “John Snow, MD: Anaesthetist to the Queen of England and Pioneer Epidemiologist,” Baylor University Medical Center Proceedings 19, No. 1 (January 2006): 24–28.

    GO TO NOTE REFERENCE IN TEXT

  7. “给维多利亚女王的氯仿”: Ramsay,《约翰·斯诺,医学博士》,第 4 页。

    转至注释 文中引用

  8. “chloroform to Queen Victoria”: Ramsay, “John Snow, MD,” p. 4.

    GO TO NOTE REFERENCE IN TEXT

  9. “已发现的各种情况”: “圣詹姆斯教区霍乱疫情报告”,第 109 页。

    转至注释 文中引用

  10. “the various points which have been found”: “Report on the Cholera Outbreak in the Parish of St. James,” p. 109.

    GO TO NOTE REFERENCE IN TEXT

  11. “可以看出,死亡人数”: “圣詹姆斯教区霍乱疫情报告”,第 109 页。

    转至注释 文中引用

  12. “It will be observed that the deaths”: “Report on the Cholera Outbreak in the Parish of St. James,” p. 109.

    GO TO NOTE REFERENCE IN TEXT

  13. Voronoi 单元: David Austin,《Voronoi 图和海滩上的一天》,美国数学学会专题专栏:数学好奇者的旅程》,2006 年 8 月,www.ams.org/ publicoutreach/feature-column/fcarc-voronoi 。

    转至注释 文中引用

  14. Voronoi cell: David Austin, “Voronoi Diagrams and a Day at the Beach,” American Mathematical Society Feature Column: Journeys for the Mathematically Curious, August 2006, www.ams.org/publicoutreach/feature-column/fcarc-voronoi.

    GO TO NOTE REFERENCE IN TEXT

  15. 机器学习中最具影响力的算法: George Chen 和 Devavrat Shah,“解释最近邻方法在预测中的成功”,机器学习基础与趋势10,第 5-6 期(2018 年 1 月):337–588。

    转至注释 文中引用

  16. the most influential algorithms in machine learning: George Chen and Devavrat Shah, “Explaining the Success of Nearest Neighbor Methods in Prediction,” Foundations and Trends in Machine Learning 10, No. 5-6 (January 2018): 337–588.

    GO TO NOTE REFERENCE IN TEXT

  17. 正是 Alhazen: Abdelghani Tbakhi 和 Samir S. Amr,“Ibn Al-Haytham:现代光学之父”,《沙特医学年鉴》 27,第 6 期(2007 年 11 月至 12 月):464-67。

    转至注释 文中引用

  18. It was Alhazen who: Abdelghani Tbakhi and Samir S. Amr, “Ibn Al-Haytham: Father of Modern Optics,” Annals of Saudi Medicine 27, No. 6 (November–December 2007): 464–67.

    GO TO NOTE REFERENCE IN TEXT

  19. “最重要的人物”: David C. Lindberg,《从艾尔肯迪到开普勒的视觉理论》(伊利诺伊州芝加哥:芝加哥大学出版社,1981 年),第 58 页。

    转至注释 文中引用

  20. “the most significant figure”: David C. Lindberg, Theories of Vision from Al-Kindi to Kepler (Chicago, Ill.: University of Chicago Press, 1981), p. 58.

    GO TO NOTE REFERENCE IN TEXT

  21. “物质复制品向四面八方扩散”:林德伯格,《从肯迪到开普勒的视觉理论》,第 3 页。

    转至注释 文中引用

  22. “Material replicas issue in all directions”: Lindberg, Theories of Vision from Al-Kindi to Kepler, p. 3.

    GO TO NOTE REFERENCE IN TEXT

  23. “这一理论的本质特征”:林德伯格,《从肯迪到开普勒的视觉理论》,第 58 页。

    转至注释 文中引用

  24. “The essential feature of this theory”: Lindberg, Theories of Vision from Al-Kindi to Kepler, p. 58.

    GO TO NOTE REFERENCE IN TEXT

  25. “改变插入理论”:林德伯格,《从肯迪到开普勒的视觉理论》,第 78 页。

    转至注释 文中引用

  26. “transforming the intromission theory”: Lindberg, Theories of Vision from Al-Kindi to Kepler, p. 78.

    GO TO NOTE REFERENCE IN TEXT

  27. “当视觉感知到某些可见物体时”: A. Mark Smith,《Alhacen 的视觉感知理论》(宾夕法尼亚州费城:美国哲学学会,2001 年),第 519 页。

    转至注释 文中引用

  28. “When sight perceives some visible object”: A. Mark Smith, Alhacen’s Theory of Visual Perception (Philadelphia, Pa.: American Philosophical Society, 2001), p. 519.

    GO TO NOTE REFERENCE IN TEXT

  29. “如果它找不到一种形式”:史密斯,《Alhacen 的视觉感知理论》,第 519 页。

    转至注释 文中引用

  30. “If it does not find a form”: Smith, Alhacen’s Theory of Visual Perception, p. 519.

    GO TO NOTE REFERENCE IN TEXT

  31. “令人惊奇的清晰,几乎是算法式的阐述”: Marcello Pelillo,《Alhazen 和最近邻规则》,《模式识别快报》第 38 卷(2014 年 3 月 1 日):第 34-37 页。

    转至注释 文中引用

  32. “a surprisingly clear, almost algorithmic, exposition”: Marcello Pelillo, “Alhazen and the Nearest Neighbor Rule,” Pattern Recognition Letters 38 (March 1, 2014): 34–37.

    GO TO NOTE REFERENCE IN TEXT

  33. “我不知道是否”: 2021 年 6 月 16 日对马塞洛·佩利洛 (Marcello Pelillo) 的 Zoom 采访。佩利洛的这段话和随后的所有引言均出自本次作者采访。

    转至注释 文中引用

  34. “I don’t know whether”: Zoom interview with Marcello Pelillo on June 16, 2021. This and all subsequent quotes by Pelillo are from this author interview.

    GO TO NOTE REFERENCE IN TEXT

  35. “这就是最近邻居规则”: 2021 年 6 月 9 日对彼得·哈特 (Peter Hart) 的 Zoom 采访。哈特的这段话和随后的所有引言均出自本次作者采访。

    转至注释 文中引用

  36. “That’s the nearest neighbor rule”: Zoom interview with Peter Hart on June 9, 2021. This and all subsequent quotes by Hart are from this author interview.

    GO TO NOTE REFERENCE IN TEXT

  37. 1940 年,Fix 开始工作: Jerzy Neyman 等人,《Evelyn Fix,统计学:伯克利,1904–1965》,关于,伯克利统计学,statistics.berkeley.edu/ about/biographies/evelyn-fix 。

    转至注释 文中引用

  38. In 1940, Fix came to work: Jerzy Neyman et al., “Evelyn Fix, Statistics: Berkeley, 1904–1965,” About, Berkeley Statistics, statistics.berkeley.edu/about/biographies/evelyn-fix.

    GO TO NOTE REFERENCE IN TEXT

  39. “战争年代很艰难”: Neyman 等人,《Evelyn Fix,统计数据:伯克利,1904-1965》。

    转至注释 文中引用

  40. “The war years were hard”: Neyman et al., “Evelyn Fix, Statistics: Berkeley, 1904–1965.”

    GO TO NOTE REFERENCE IN TEXT

  41. 1951 年技术报告: Evelyn Fix 和 JL Hodges, Jr.,“歧视性分析。非参数歧视:一致性属性”,国际统计评论57,第 3 期(1989 年 12 月):238-47。

    转至注释 文中引用

  42. the technical report of 1951: Evelyn Fix and J. L. Hodges, Jr., “Discriminatory Analysis. Nonparametric Discrimination: Consistency Properties,” International Statistical Review 57, No. 3 (December 1989): 238–47.

    GO TO NOTE REFERENCE IN TEXT

  43. 甚至果蝇也被认为可以使用: Sanjoy Dasgupta、Charles F. Stevens 和 Saket Navlakha,《一种用于基本计算问题的神经算法》,《科学》 358,第 6364 期(2017 年 11 月 10 日):793–96。

    转至注释 文中引用

  44. Even fruit flies are thought to use: Sanjoy Dasgupta, Charles F. Stevens, and Saket Navlakha, “A Neural Algorithm for a Fundamental Computing Problem,” Science 358, No. 6364 (November 10, 2017): 793–96.

    GO TO NOTE REFERENCE IN TEXT

  45. “维数灾难”: Richard Bellman,《动态规划》 (新泽西州普林斯顿:普林斯顿大学出版社,1972 年),第 ix 页。

    转至注释 文中引用

  46. “the curse of dimensionality”: Richard Bellman, Dynamic Programming (Princeton, N.J.: Princeton University Press, 1972), p. ix.

    GO TO NOTE REFERENCE IN TEXT

  47. “在高维空间中,没有人能听到你的尖叫”: Julie Delon,《维数的诅咒》, PDF,mathlabs-coffees.github.io/ slides/mc08-delon.pdf 。

    转至注释 文中引用

  48. “In high dimensional spaces, nobody can hear you scream”: Julie Delon, The Curse of Dimensionality, PDF, mathematical-coffees.github.io/slides/mc08-delon.pdf.

    GO TO NOTE REFERENCE IN TEXT

  49. 在他的讲座中: Thomas Strohmer,“人工智能和大数据分析的数学算法”,PDF(2017 年春季),www.math.ucdavis.edu /~strohmer/courses/180BigData/180lecture1.pdf 。

    转至注释 文中引用

  50. In his lectures: Thomas Strohmer, “Mathematical Algorithms for Artificial Intelligence and Big Data Analysis,” PDF (Spring 2017), www.math.ucdavis.edu/~strohmer/courses/180BigData/180lecture1.pdf.

    GO TO NOTE REFERENCE IN TEXT

  51. 体积由以下公式给出:有关超球面的体积,请参见“Hypershere”,Wolfram MathWorld,nd,mathworld.wolfram.com/Hypersphere.html;有关 Gamma 函数,请参见“Gamma 函数”,Wolfram MathWorld,nd,mathworld.wolfram.com/GammaFunction.html

    转至注释 文中引用

  52. The volume is given by this formula: For the volume of a hypersphere, see “Hypershere,” Wolfram MathWorld, n.d., mathworld.wolfram.com/Hypersphere.html; and for Gamma function, see “Gamma Function,” Wolfram MathWorld, n.d., mathworld.wolfram.com/GammaFunction.html.

    GO TO NOTE REFERENCE IN TEXT

  53. 这个问题被优雅地分析了: Alon Amit,Quora,nd,https://www.quora.com/Why-is-the-higher-the-dimension-the-less-the-hypervolume-of-a-hypersphere-inscribed-in-a-hypercube-occupy-the-hypervolume-of-the-hypercube

    转至注释 文中引用

  54. This problem is elegantly analyzed: Alon Amit, Quora, n.d., https://www.quora.com/Why-is-the-higher-the-dimension-the-less-the-hypervolume-of-a-hypersphere-inscribed-in-a-hypercube-occupy-the-hypervolume-of-the-hypercube.

    GO TO NOTE REFERENCE IN TEXT

  55. 可观测宇宙中的原子数量: Harry Baker,《可观测宇宙中有多少个原子?》,新闻,LiveScience,2021 年 7 月 10 日,www.livescience.com/ how-many-atoms-in-universe.html 。

    转至注释 文中引用

  56. number of atoms in the observable universe: Harry Baker, “How Many Atoms Are in the Observable Universe?,” News, LiveScience, July 10, 2021, www.livescience.com/how-many-atoms-in-universe.html.

    GO TO NOTE REFERENCE IN TEXT

  57. “因为这是一个诅咒”: Bellman,动态规划,第 ix 页。

    转至注释 文中引用

  58. “Since this is a curse”: Bellman, Dynamic Programming, p. ix.

    GO TO NOTE REFERENCE IN TEXT

第 6 章:矩阵中的魔法

CHAPTER 6: THERE’S MAGIC IN THEM MATRICES

  1. “现在,观看这个”: 2022 年 2 月 3 日对埃默里·布朗 (Emery Brown) 的 Zoom 采访。布朗的这句话以及随后的所有引言均出自本次作者采访。

    转至注释 文中引用

  2. “Now, watch this”: Zoom interview with Emery Brown on February 3, 2022. This and all subsequent quotes by Brown are from this author interview.

    GO TO NOTE REFERENCE IN TEXT

  3. 在布朗团队进行的一项研究中: John H. Abel 等人,“构建人类全身麻醉过程中 EEG 信号的控制就绪模型”,IFAC-PapersOnLine 53,No. 2(2020):15870–76。

    转至注释 文中引用

  4. In one study done by Brown’s team: John H. Abel et al., “Constructing a Control-Ready Model of EEG Signal During General Anesthesia in Humans,” IFAC-PapersOnLine 53, No. 2 (2020): 15870–76.

    GO TO NOTE REFERENCE IN TEXT

  5. 主成分分析:若想深入了解该主题,请观看 Steve Brunton 和 Nathan Kutz 在 YouTube 上发布的 PCA 视频,https://www.youtube.com/watch?v=fkf4IBRSeEc,以及https://www.youtube.com/watch?v=a9jdQGybYmE

    转至注释 文中引用

  6. principal component analysis: For a deep dive into the subject, see YouTube videos on PCA by Steve Brunton, https://www.youtube.com/watch?v=fkf4IBRSeEc, and Nathan Kutz, https://www.youtube.com/watch?v=a9jdQGybYmE.

    GO TO NOTE REFERENCE IN TEXT

  7. Grundzüge einer allgemeinen Theorie der Linearen Integralgleichungen:本文作为章节重印于 D. Hilbert 和 E. Schmidt, Integralgleichungen und Gleichungen mit unendlich vielen Unbekannten (莱比锡:BSB BG Teubner Verlagsgesellschaft,1989),第 8-10 页。

    转至注释 文中引用

  8. Grundzüge einer allgemeinen Theorie der linearen Integralgleichungen: The paper is reprinted as a chapter in D. Hilbert and E. Schmidt, Integralgleichungen und Gleichungen mit unendlich vielen Unbekannten (Leipzig: BSB B. G. Teubner Verlagsgesellschaft, 1989), pp. 8–10.

    GO TO NOTE REFERENCE IN TEXT

  9. Wolfram Alpha:访问www.wolframalpha.com/并在查询框中输入“eigenvalues {{1, 1}, {0, -2}}”并按回车键。Wolfram Alpha 将计算特征向量和特征值。

    转至注释 文中引用

  10. Wolfram Alpha: Go to www.wolframalpha.com/ and type “eigenvalues {{1, 1}, {0, -2}}” into the query box and hit Return. Wolfram Alpha will calculate the eigenvectors and eigenvalues.

    GO TO NOTE REFERENCE IN TEXT

  11. 在他的斯坦福讲座中: Anand Avati,“讲座 1 - 简介和线性代数”,斯坦福 CS229,机器学习,2019 年夏季,YouTube 视频,nd,https://youtu.be/KzH1ovd4Ots

    转至注释 文中引用

  12. In his Stanford lectures: Anand Avati, “Lecture 1 - Introduction and Linear Algebra,” Stanford CS229, Machine Learning, Summer 2019, YouTube video, n.d., https://youtu.be/KzH1ovd4Ots.

    GO TO NOTE REFERENCE IN TEXT

  13. 假设我们有一个车辆数据集:这个例子由 John Abel 于 2023 年 1 月 9 日在电子邮件中描述。

    转至注释 文中引用

  14. Let’s say we have a dataset of vehicles: This example was described by John Abel in email correspondence on January 9, 2023.

    GO TO NOTE REFERENCE IN TEXT

  15. “你必须知道”:肯尼·罗杰斯,《赌徒》,歌词由唐·施利茨作词,Songfacts,nd,https://www.songfacts.com/lyrics/kenny-rogers/the-gambler

    转至注释 文中引用

  16. “You’ve got to know”: Kenny Rogers, “The Gambler,” lyrics by Don Schlitz, Songfacts, n.d., https://www.songfacts.com/lyrics/kenny-rogers/the-gambler.

    GO TO NOTE REFERENCE IN TEXT

  17. “优生学家的工作”:罗纳德·A·费舍尔,《分类问题中多种测量方法的使用》,《优生学年鉴》第 7 卷,第 2 期(1936 年 9 月):第 179-183 页。

    转至注释 文中引用

  18. “The work of eugenicists”: Ronald A. Fisher, “The Use of Multiple Measurements in Taxonomic Problems,” Annals of Eugenics 7, No. 2 (September 1936): 179–83.

    GO TO NOTE REFERENCE IN TEXT

  19. “一英里又一英里地走下去”: David F. Andrews 和 AM Herzberg,《数据:面向学生和研究人员的许多领域问题集锦》(纽约:Springer-Verlag,1985 年),第 5 页。

    转至注释 文中引用

  20. “There for mile after mile”: David F. Andrews and A. M. Herzberg, Data: A Collection of Problems from Many Fields for the Student and Research Worker (New York: Springer-Verlag, 1985), p. 5.

    GO TO NOTE REFERENCE IN TEXT

  21. “这是一个非常丰富的数据集”: 2022 年 2 月 24 日,约翰·阿贝尔 (John Abel) 在 Zoom 上的采访。阿贝尔的这段话以及随后的所有引言均出自本次作者采访。

    转至注释 文中引用

  22. “It’s a very rich dataset”: Zoom interview with John Abel on February 24, 2022. This and all subsequent quotes by Abel are from this author interview.

    GO TO NOTE REFERENCE IN TEXT

  23. 您将获得以下内容: Abel 等人,“构建人类全身麻醉期间的 EEG 信号控制模型”,第 15873 页。

    转至注释 文中引用

  24. here’s what you get: Abel et al., “Constructing a Control-Ready Model of EEG Signal During General Anesthesia in Humans,” p. 15873.

    GO TO NOTE REFERENCE IN TEXT

第七章:伟大的核绳技巧

CHAPTER 7: THE GREAT KERNEL ROPE TRICK

  1. 新移民: “Vladimir Vapnik”,富兰克林研究所,nd,https://www.fi.edu/en/laureates/vladimir-vapnik

    转至注释 文中引用

  2. a recent immigrant: “Vladimir Vapnik,” The Franklin Institute, n.d., https://www.fi.edu/en/laureates/vladimir-vapnik.

    GO TO NOTE REFERENCE IN TEXT

  3. 出现在附录中: Vladimir Vapnik,基于经验数据的依赖关系估计(纽约:Springer-Verlag,1982 年),第 362 页。

    转至注释 文中引用

  4. appeared in an addendum: Vladimir Vapnik, Estimation of Dependencies Based on Empirical Data (New York: Springer-Verlag, 1982), p. 362.

    GO TO NOTE REFERENCE IN TEXT

  5. 出色的阐述: Patrick Winston 关于 SVM 的讲座清晰地解释了数学原理。请参阅“[讲座] 16:支持向量机”,麻省理工学院开放式课程,2010 年秋季,YouTube 视频,日期:https://www.youtube.com/watch? v=_PwhiWxHK8o 。

    转至注释 文中引用

  6. excellent exposition: Patrick Winston’s lecture on SVMs lucidly explains the mathematics. See “[Lecture] 16: Support Vector Machines,” MIT OpenCourseWare, Fall 2010, YouTube video, n.d., https://www.youtube.com/watch?v=_PwhiWxHK8o.

    GO TO NOTE REFERENCE IN TEXT

  7. “一种科学诗”: C. Truesdell,《力学史论文集》(柏林和海德堡:Springer-Verlag,1968 年),第 86 页。

    转至注释 文中引用

  8. “a kind of scientific poem”: C. Truesdell, Essays in the History of Mechanics (Berlin and Heidelberg: Springer-Verlag, 1968), p. 86.

    GO TO NOTE REFERENCE IN TEXT

  9. Krauth 和 Mézard: Werner Krauth 和 Marc Mézard,《神经网络中具有最佳稳定性的学习算法》,《物理学 A 杂志:数学和理论》 20,第 11 期(1987):L745–52。

    转至注释 文中引用

  10. Krauth and Mézard: Werner Krauth and Marc Mézard, “Learning Algorithms with Optimal Stability in Neural Networks,” Journal of Physics A: Mathematical and Theoretical 20, No. 11 (1987): L745–52.

    GO TO NOTE REFERENCE IN TEXT

  11. “考官之一”: 2021 年 11 月 12 日对伊莎贝尔·盖恩 (Isabelle Guyon) 的 Zoom 采访。盖恩的这段话以及随后的所有引言均出自本次作者采访。

    转至注释 文中引用

  12. “One of the examiners”: Zoom interview with Isabelle Guyon on November 12, 2021. This and all subsequent quotes by Guyon are from this author interview.

    GO TO NOTE REFERENCE IN TEXT

  13. 三位俄罗斯研究人员: MA Aizerman,EM Braverman 和 LI Rozonoer,《模式识别中潜在基础方法的理论基础》,《自动化和远程控制》 25(1964):821-37。

    转至注释 文中引用

  14. by three Russian researchers: M. A. Aizerman, E. M. Braverman, and L. I. Rozonoer, “Theoretical Foundations of the Potential Foundations Method in Pattern Recognition,” Automation and Remote Control 25 (1964): 821–37.

    GO TO NOTE REFERENCE IN TEXT

  15. 多项式核: T. Poggio,“论最佳非线性联想回忆”,生物控制论19(1975):201-9。

    转至注释 文中引用

  16. polynomial kernel: T. Poggio, “On Optimal Nonlinear Associative Recall,” Biological Cybernetics 19 (1975): 201–9.

    GO TO NOTE REFERENCE IN TEXT

  17. 高维空间的大小: John Shawe-Taylor 和 Nello Cristianini,《模式分析的核方法》(英国剑桥:剑桥大学出版社,2004 年),第 293 页。

    转至注释 文中引用

  18. The size of the higher-dimensional space: John Shawe-Taylor and Nello Cristianini, Kernel Methods for Pattern Analysis (Cambridge, UK: Cambridge University Press, 2004), p. 293.

    GO TO NOTE REFERENCE IN TEXT

  19. “这是对代码的一个非常简单的更改”: 2021 年 7 月 16 日对 Bernhard Boser 的 Zoom 采访。Boser 的这段话以及随后的所有引言均出自本次作者采访。

    转至注释 文中引用

  20. “This was a very simple change to the code”: Zoom interview with Bernhard Boser on July 16, 2021. This and all subsequent quotes by Boser are from this author interview.

    GO TO NOTE REFERENCE IN TEXT

  21. 径向基函数 (RBF) 核: DS Broomhead 和 D. Lowe,“多变量函数插值和自适应网络”,复杂系统2 (1988):321-55。

    转至注释 文中引用

  22. the radial basis function (RBF) kernel: D. S. Broomhead and D. Lowe, “Multivariable Functional Interpolation and Adaptive Networks,” Complex Systems 2 (1988): 321–55.

    GO TO NOTE REFERENCE IN TEXT

  23. 对于那些好奇的人来说,这是核函数: Andrew Ng,“练习 8:使用核的非线性 SVM 分类”,机器学习课程,OpenClassroom,openclassroom.stanford.edu/MainFolder/ DocumentPage.php?course=MachineLearning&doc=exercises/ex8/ex8.html 。

    转至注释 文中引用

  24. For the curious, here’s the kernel function: Andrew Ng, “Exercise 8: Non-linear SVM Classification with Kernels,” for course Machine Learning, OpenClassroom, openclassroom.stanford.edu/MainFolder/DocumentPage.php?course=MachineLearning&doc=exercises/ex8/ex8.html.

    GO TO NOTE REFERENCE IN TEXT

  25. “内核中的布拉德·皮特”: Kilian Weinberger,“机器学习讲座 22:更多关于内核的内容 - Cornell CS4780 SP17”,YouTube 视频,nd,https://youtu.be/FgTQG2IozlM,第 38:08 页。

    转至注释 文中引用

  26. “Brad Pitt of kernels”: Kilian Weinberger, “Machine Learning Lecture 22: More on Kernels—Cornell CS4780 SP17,” YouTube video, n.d., https://youtu.be/FgTQG2IozlM, at 38:08.

    GO TO NOTE REFERENCE IN TEXT

  27. “当然是通过 KERNEL TRICK!!!”: 2022 年 3 月 6 日和 3 月 9 日对 Manfred Warmuth 的电子邮件采访。Warmuth 的这句话和随后的所有引言均来自这些作者采访。

    转至注释 文中引用

  28. “of course by the KERNEL TRICK!!!”: Email interviews with Manfred Warmuth on March 6 and March 9, 2022. This and all subsequent quotes by Warmuth are from these author interviews.

    GO TO NOTE REFERENCE IN TEXT

  29. 它出现在 1992 年 7 月: Bernhard E. Boser、Isabelle M. Guyon 和 Vladimir N. Vapnik,“最佳边缘分类器的训练算法”,COLT '92:第五届计算学习理论年度研讨会论文集(1992 年 7 月):144-52。

    转至注释 文中引用

  30. it appeared in July 1992: Bernhard E. Boser, Isabelle M. Guyon, and Vladimir N. Vapnik, “A Training Algorithm for Optimal Margin Classifiers,” COLT ’92: Proceedings of the Fifth Annual Workshop on Computational Learning Theory (July 1992): 144–52.

    GO TO NOTE REFERENCE IN TEXT

  31. 该方法发表于 1995 年: Corinna Cortes 和 Vladimir Vapnik,“支持向量网络”,机器学习20(1995):273-97。

    转至注释 文中引用

  32. This approach, published in 1995: Corinna Cortes and Vladimir Vapnik, “Support-Vector Networks,” Machine Learning 20 (1995): 273–97.

    GO TO NOTE REFERENCE IN TEXT

  33. “曼弗雷德 [沃穆斯] 和我给予了瓦普尼克一种超级明星的地位”: 2022 年 3 月 7 日和 3 月 8 日与大卫·豪斯勒的电子邮件采访。豪斯勒的这句话和随后的所有引言均来自这些作者采访。

    转至注释 文中引用

  34. “Manfred [Warmuth] and I gave Vapnik a kind of superstar status”: Email interviews with David Haussler on March 7 and March 8, 2022. This and all subsequent quotes by Haussler are from these author interviews.

    GO TO NOTE REFERENCE IN TEXT

  35. “Vapnik-Chervonenkis (VC) 维度”: Anselm Blumer 等人,《可学习性和 Vapnik-Chervonenkis 维度》,《ACM 杂志》 36,第 4 期(1989 年 10 月):929-65。

    转至注释 文中引用

  36. “Vapnik-Chervonenkis (VC) dimension”: Anselm Blumer et al., “Learnability and the Vapnik-Chervonenkis Dimension,” Journal of the ACM 36, No. 4 (October 1989): 929–65.

    GO TO NOTE REFERENCE IN TEXT

  37. 知识前沿奖: “知识前沿奖表彰 Guyon、Schölkopf 和 Vapnik 教机器如何分类数据”,BBVA 基金会,2020 年 2 月,https://tinyurl.com/bddcdtv8

    转至注释 文中引用

  38. Frontiers of Knowledge Award: “The Frontiers of Knowledge Awards recognize Guyon, Schölkopf, and Vapnik for Teaching Machines How to Classify Data,” BBVA Foundation, February 2020, https://tinyurl.com/bddcdtv8.

    GO TO NOTE REFERENCE IN TEXT

  39. 一本关于核方法的综合性书籍: Bernhard Schölkopf 和 A​​lexander J. Smola 的《使用核学习:支持向量机、正则化、优化及其他》(马萨诸塞州剑桥:麻省理工学院出版社,2001 年)。

    转至注释 文中引用

  40. a comprehensive book on kernel methods: Bernhard Schölkopf and Alexander J. Smola, Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond (Cambridge, Mass.: The MIT Press, 2001).

    GO TO NOTE REFERENCE IN TEXT

第 8 章:借助物理学

CHAPTER 8: WITH A LITTLE HELP FROM PHYSICS

  1. “现在怎么办?”:约翰·霍普菲尔德,“现在怎么办?”普林斯顿神经科学研究所,2018 年 10 月,https://pni.princeton.edu/people/john-j-hopfield/now-what

    转至注释 文中引用

  2. “Now what?”: John Hopfield, “Now What?” Princeton Neuroscience Institute, October 2018, https://pni.princeton.edu/people/john-j-hopfield/now-what.

    GO TO NOTE REFERENCE IN TEXT

  3. “你不可能让事情变得没有错误”: 2021 年 10 月 25 日对约翰·霍普菲尔德的 Zoom 采访。霍普菲尔德的这段引言以及随后的所有引言均出自本次作者采访,除非注明是出自他的文章“现在怎么办?”

    转至注释 文中引用

  4. “You can’t make things error-free”: Zoom interview with John Hopfield on October 25, 2021. This and all subsequent quotes by Hopfield are from this author interview, unless identified as quotes from his essay “Now What?”

    GO TO NOTE REFERENCE IN TEXT

  5. 霍普菲尔德于 1974 年发表了他的“生物学”论文: John Hopfield,《动力学校对:一种减少需要高特异性的生物合成过程中错误的新机制》,《美国国家科学院院刊》第 71 卷,第 10 期(1974 年 10 月 1 日):4135–39。

    转至注释 文中引用

  6. Hopfield published his “biology” paper in 1974: John Hopfield, “Kinetic Proofreading: A New Mechanism for Reducing Errors in Biosynthetic Processes Requiring High Specificity,” Proceedings of the National Academy of Sciences 71, No. 10 (October 1, 1974): 4135–39.

    GO TO NOTE REFERENCE IN TEXT

  7. “这是第一篇论文”:霍普菲尔德,《现在怎么办?》

    转至注释 文中引用

  8. “This was the first paper”: Hopfield, “Now What?”

    GO TO NOTE REFERENCE IN TEXT

  9. 1976 年,霍普菲尔德在哈佛大学发表演讲:霍普菲尔德,“现在怎么办?”

    转至注释 文中引用

  10. In 1976, Hopfield gave a talk at Harvard: Hopfield, “Now What?”

    GO TO NOTE REFERENCE IN TEXT

  11. “这是最大的事件之一”:霍普菲尔德,“现在怎么办?”

    转至注释 文中引用

  12. “It was one of the biggest”: Hopfield, “Now What?”

    GO TO NOTE REFERENCE IN TEXT

  13. “网络可以‘解决问题’或者”:霍普菲尔德,“现在怎么办?”

    转至注释 文中引用

  14. “A network could ‘solve a problem’ or”: Hopfield, “Now What?”

    GO TO NOTE REFERENCE IN TEXT

  15. “心智如何从大脑中产生”:霍普菲尔德,《现在怎么办?》

    转至注释 文中引用

  16. “How mind emerges from brain”: Hopfield, “Now What?”

    GO TO NOTE REFERENCE IN TEXT

  17. “灵长类神经解剖学”:霍普菲尔德,“现在怎么办?”

    转至注释 文中引用

  18. “primate neuroanatomy”: Hopfield, “Now What?”

    GO TO NOTE REFERENCE IN TEXT

  19. 它是一种无定形固体: Ciara Curtin,“事实还是虚构?:玻璃是一种(过冷)液体”,《科学美国人》, 2007 年 2 月 22 日,https://www.scientificamerican.com/article/fact-fiction-glass-liquid/

    转至注释 文中引用

  20. it’s an amorphous solid: Ciara Curtin, “Fact or Fiction?: Glass Is a (Supercooled) Liquid,” Scientific American, February 22, 2007, https://www.scientificamerican.com/article/fact-fiction-glass-liquid/.

    GO TO NOTE REFERENCE IN TEXT

  21. 某些材料……是铁磁性的: “铁磁性”,LibreTexts,nd,https://tinyurl.com/2p8jcxmf

    转至注释 文中引用

  22. Certain materials…are ferromagnetic: “Ferromagnetism,” LibreTexts, n.d., https://tinyurl.com/2p8jcxmf.

    GO TO NOTE REFERENCE IN TEXT

  23. 博士论文: SG Brush,《楞次-伊辛模型的历史》,《现代物理评论》 39,第4期(1967年):883-93页。

    转至注释 文中引用

  24. For his doctoral thesis: S. G. Brush, “History of the Lenz-Ising Model,” Reviews of Modern Physics 39, No. 4 (1967): 883–93.

    GO TO NOTE REFERENCE IN TEXT

  25. 鲁道夫·恩斯特·佩尔斯 (Rudolf Ernst Peierls),德国物理学家: Lee Sabine,《鲁道夫·恩斯特·佩尔斯,1907 年 6 月 5 日 - 1995 年 9 月 19 日》,《皇家学会院士传记回忆录》 ,2007 年 12 月 1 日,第 53265-84 页。

    转至注释 文中引用

  26. Rudolf Ernst Peierls, a German physicist: Lee Sabine, “Rudolf Ernst Peierls, 5 June 1907–19 September 1995,” Biographical Memoirs of Fellows of the Royal Society, December 1, 2007, pp. 53265–84.

    GO TO NOTE REFERENCE IN TEXT

  27. “对于足够低的温度”: RH Dalitz 和 Sir Rudolf Peierls 编,Selected Scientific Papers of Sir Rudolf Peierls(新加坡:World Scientific Publishing,1997 年),第 229 页。

    转至注释 文中引用

  28. “For sufficiently low temperatures”: R. H. Dalitz and Sir Rudolf Peierls, eds., Selected Scientific Papers of Sir Rudolf Peierls (Singapore: World Scientific Publishing, 1997), p. 229.

    GO TO NOTE REFERENCE IN TEXT

  29. 系统的哈密顿量: Giorgio Parisi,《自旋玻璃和易碎玻璃:静力学、动力学和复杂性》,《美国国家科学院院刊》 103,第 21 期(2006 年 5 月 23 日):7948-55。

    转至注释 文中引用

  30. the Hamiltonian of a system: Giorgio Parisi, “Spin Glasses and Fragile Glasses: Statics, Dynamics, and Complexity,” Proceedings of the National Academy of Sciences 103, No. 21 (May 23, 2006): 7948–55.

    GO TO NOTE REFERENCE IN TEXT

  31. 其中一篇: Ada Altieri 和 Marco Baity-Jesi,《自旋玻璃理论简介》,arXiv,2023 年 2 月 9 日,https ://arxiv.org/abs/2302.04842 。另请参阅:Viktor Dotsenko,《自旋玻璃和神经网络理论简介》(新加坡:World Scientific,1994 年),第 4、113 页。

    转至注释 文中引用

  32. Here’s one: Ada Altieri and Marco Baity-Jesi, “An Introduction to the Theory of Spin Glasses,” arXiv, February 9, 2023, https://arxiv.org/abs/2302.04842. Also, see: Viktor Dotsenko, An Introduction to the Theory of Spin Glasses and Neural Networks (Singapore: World Scientific, 1994), pp. 4, 113.

    GO TO NOTE REFERENCE IN TEXT

  33. 霍普菲尔德引用了海明威的话:霍普菲尔德,“现在怎么办?”

    转至注释 文中引用

  34. Hopfield quotes Hemingway: Hopfield, “Now What?”

    GO TO NOTE REFERENCE IN TEXT

  35. 证明如下: Raúl Rojas,神经网络:系统介绍(柏林:Springer,2013 年),第 349-54 页。

    转至注释 文中引用

  36. The proof that follows: Raúl Rojas, Neural Networks: A Systematic Introduction (Berlin: Springer, 2013), pp. 349–54.

    GO TO NOTE REFERENCE IN TEXT

  37. “由于只有有限集”: Rojas,《神经网络》,第 353 页。

    转至注释 文中引用

  38. “Since there is only a finite set”: Rojas, Neural Networks, p. 353.

    GO TO NOTE REFERENCE IN TEXT

第 9 章:阻碍深度学习发展的人(并非如此)

CHAPTER 9: THE MAN WHO SET BACK DEEP LEARNING (NOT REALLY)

  1. “我感觉自己像个摇滚明星”: 2021 年 11 月 11 日对乔治·西本科 (George Cybenko) 的 Zoom 采访。Cybenko 的这句话和所有后续引言均出自本次作者采访。

    转至注释 文中引用

  2. “I felt like a rock star”: Zoom interview with George Cybenko on November 11, 2021. This and all subsequent quotes by Cybenko are from this author interview.

    GO TO NOTE REFERENCE IN TEXT

  3. Cybenko 后来阅读的博客文章: Vincenzo Lomonaco,“我在 2017 年毕尔巴鄂深度学习暑期学校学到了什么”,Medium,2017 年 7 月 27 日,https://tinyurl.com/4xhc7h9e

    转至注释 文中引用

  4. blog post Cybenko later read: Vincenzo Lomonaco, “What I Learned at the Deep Learning Summer School 2017 in Bilbao,” Medium, July 27, 2017, https://tinyurl.com/4xhc7h9e.

    GO TO NOTE REFERENCE IN TEXT

  5. 有关此方法的详细视觉分析:第 4 章:“神经网络可以计算任何函数的视觉证明”,载于 Michael Nielsen 著《神经网络与深度学习》 (Determination Press,2015 年),http://neuralnetworksanddeeplearning.com/chap4.html

    转至注释 文中引用

  6. For a detailed visual analysis of this approach: Chapter 4: “A Visual Proof that Neural Nets Can Compute Any Function,” in Michael Nielsen, Neural Networks and Deep Learning (Determination Press, 2015), http://neuralnetworksanddeeplearning.com/chap4.html.

    GO TO NOTE REFERENCE IN TEXT

  7. 小型技术报告: G. Cybenko,“具有两个隐藏层的连续值神经网络是足够的”,技术报告,1988 年,塔夫茨大学计算机科学系。

    转至注释 文中引用

  8. small technical report: G. Cybenko, “Continuous Valued Neural Networks with Two Hidden Layers Are Sufficient,” Technical Report, 1988, Department of Computer Science, Tufts University.

    GO TO NOTE REFERENCE IN TEXT

  9. 在他 1989 年的论文结论段中: G. Cybenko,《S 形函数叠加近似》,数学控制信号系统2(1989 年 12 月):303-14。

    转至注释 文中引用

  10. In the concluding paragraph of his 1989 paper: G. Cybenko, “Approximation by Superpositions of a Sigmoidal Function,” Mathematics Control Signal Systems 2 (December 1989): 303–14.

    GO TO NOTE REFERENCE IN TEXT

第 10 章:算法终结了持久的神话

CHAPTER 10: THE ALGORITHM THAT PUT PAID TO A PERSISTENT MYTH

  1. “是的”: 2021 年 10 月 1 日对 Geoffrey Hinton 的 Zoom 采访。Hinton 的这句话和随后的引言均出自本次作者采访。

    转至注释 文中引用

  2. “Yes”: Zoom interview with Geoffrey Hinton on October 1, 2021. This and subsequent quotes by Hinton are from this author interview.

    GO TO NOTE REFERENCE IN TEXT

  3. 理论化学家:克里斯·达尔文,《克里斯托弗·隆格特-希金斯:具有化学天赋的认知科学家》,《卫报》, 2004 年 6 月 10 日,https://www.theguardian.com/news/2004/jun/10/guardianobituaries.highereducation

    转至注释 文中引用

  4. a theoretical chemist: Chris Darwin, “Christopher Longuet-Higgins: Cognitive Scientist with a Flair for Chemistry,” The Guardian, June 10, 2004, https://www.theguardian.com/news/2004/jun/10/guardianobituaries.highereducation.

    GO TO NOTE REFERENCE IN TEXT

  5. “要描述的程序”: Frank Rosenblatt,《神经动力学原理:感知器和脑机制理论》,康奈尔大学报告编号 1196-G-8,1961 年 3 月 15 日,第 292 页。

    转至注释 文中引用

  6. “The procedure to be described”: Frank Rosenblatt, Principles of Neurodynamics: Perceptrons and the Theory of Brain Mechanisms, Cornell University Report No. 1196-G-8, March 15, 1961, p. 292.

    GO TO NOTE REFERENCE IN TEXT

  7. “看起来”: Rosenblatt,《神经动力学原理》,第 287 页。

    转至注释 文中引用

  8. “It would seem that”: Rosenblatt, Principles of Neurodynamics, p. 287.

    GO TO NOTE REFERENCE IN TEXT

  9. “虽然这个定理表明”: Rosenblatt,《神经动力学原理》,第 291 页。

    转至注释 文中引用

  10. “While this theorem shows”: Rosenblatt, Principles of Neurodynamics, p. 291.

    GO TO NOTE REFERENCE IN TEXT

  11. “让机器制造”: M. Minsky 和 ​​OG Selfridge,《随机网络中的学习》,载于《信息理论》,EC Cherry 主编(伦敦:Butterworth,1961 年),第 335-47 页。

    转至注释 文中引用

  12. “Let the machine make”: M. Minsky and O. G. Selfridge, “Learning in Random Nets,” in Information Theory, ed. E. C. Cherry (London: Butterworth, 1961), pp. 335–47.

    GO TO NOTE REFERENCE IN TEXT

  13. “明斯基和帕普特非常专注”: Hubert L. Dreyfus 和 Stuart E. Dreyfus,《创造心智与大脑建模:人工智能重回分支点》,《代达罗斯》 117 卷,第 1 期(1988 年冬季):15-43 页。

    转至注释 文中引用

  14. “Minsky and Papert were so intent”: Hubert L. Dreyfus and Stuart E. Dreyfus, “Making a Mind Versus Modeling the Brain: Artificial Intelligence Back at a Branchpoint,” Daedalus 117, No. 1 (Winter 1988): 15–43.

    GO TO NOTE REFERENCE IN TEXT

  15. 并在一篇详尽的博客文章中写道: Jürgen Schmidhuber,“谁发明了反向传播?” AI 博客(blog),2014 年,https://people.idsia.ch/~juergen/who-invented-backpropagation.html

    转至注释 文中引用

  16. And in an exhaustive blog post: Jürgen Schmidhuber, “Who Invented Backpropagation?” AI Blog (blog), 2014, https://people.idsia.ch/~juergen/who-invented-backpropagation.html.

    GO TO NOTE REFERENCE IN TEXT

  17. Paul Werbos 提交了他的博士论文: P. Werbos,“超越回归:行为科学中预测和分析的新工具”(博士论文,哈佛大学,1974 年)。

    转至注释 文中引用

  18. Paul Werbos submitted his Ph.D. thesis: P. Werbos, “Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Sciences” (Ph.D. diss., Harvard University, 1974).

    GO TO NOTE REFERENCE IN TEXT

  19. “总的来说,程序”: Werbos,《超越回归》。

    转至注释 文中引用

  20. “In general, the procedure”: Werbos, “Beyond Regression.”

    GO TO NOTE REFERENCE IN TEXT

  21. “由于权重调整”: David E. Rumelhart、Geoffrey E. Hinton 和 Ronald J. Williams,“通过反向传播误差学习表征”,《自然》 323(1986 年 10 月):533-36。

    转至注释 文中引用

  22. “As a result of the weight adjustments”: David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams, “Learning Representations by Back-propagating Errors,” Nature 323 (October 1986): 533–36.

    GO TO NOTE REFERENCE IN TEXT

  23. “法国有个孩子”: 2021 年 10 月 11 日对 Yann LeCun 的 Zoom 采访。LeCun 的这段话以及随后的所有引言均出自本次作者采访。

    转至注释 文中引用

  24. “there is a kid in France”: Zoom interview with Yann LeCun on October 11, 2021. This and all subsequent quotes by LeCun are from this author interview.

    GO TO NOTE REFERENCE IN TEXT

第 11 章:机器的眼睛

CHAPTER 11: THE EYES OF A MACHINE

  1. “现在获奖”: HB Barlow,《David Hubel 和 Torsten Wiesel:他们对理解初级视觉皮层的贡献》,《神经科学趋势》 5(1982 年):145-52。

    转至注释 文中引用

  2. “By now the award”: H. B. Barlow, “David Hubel and Torsten Wiesel: Their Contribution Towards Understanding the Primary Visual Cortex,” Trends in Neuroscience 5 (1982): 145–52.

    GO TO NOTE REFERENCE IN TEXT

  3. “该电极已被使用”: David H. Hubel,《用于单个单元记录的钨微电极》,《科学》 125(1957 年 3 月 22 日):549–50。

    转至注释 文中引用

  4. “The electrode has been used”: David H. Hubel, “Tungsten Microelectrode for Recording from Single Units,” Science 125 (March 22, 1957): 549–50.

    GO TO NOTE REFERENCE IN TEXT

  5. 这些电极就是 Hubel 和 Wiesel: DH Hubel 和 TN Wiesel,《猫纹状皮层中单个神经元的感受野》,《生理学杂志》 148(1959):574–91。

    转至注释 文中引用

  6. It’s these electrodes that Hubel and Wiesel: D. H. Hubel and T. N. Wiesel, “Receptive Fields of Single Neurones in the Cat’s Striate Cortex,” Journal of Physiology 148 (1959): 574–91.

    GO TO NOTE REFERENCE IN TEXT

  7. “使得必须使用人工呼吸”: Hubel 和 Wiesel,《猫纹状皮层中单个神经元的感受野》。

    转至注释 文中引用

  8. “made it necessary to use artificial respiration”: Hubel and Wiesel, “Receptive Fields of Single Neurones in the Cat’s Striate Cortex.”

    GO TO NOTE REFERENCE IN TEXT

  9. “干燥和变得浑浊”: Hubel 和 Wiesel,“猫纹状皮层中单个神经元的感受场。”

    转至注释 文中引用

  10. “drying and becoming cloudy”: Hubel and Wiesel, “Receptive Fields of Single Neurones in the Cat’s Striate Cortex.”

    GO TO NOTE REFERENCE IN TEXT

  11. “在哈佛大学,小猫被弄瞎了眼睛”:史蒂文·扎克,《实验室的残忍》,《纽约时报》, 1983 年 5 月 16 日,https://www.nytimes.com/1983/05/16/opinion/cruelty-in-labs.html

    转至注释 文中引用

  12. “At Harvard University, kittens were blinded”: Steven Zak, “Cruelty in Labs,” New York Times, May 16, 1983, https://www.nytimes.com/1983/05/16/opinion/cruelty-in-labs.html.

    GO TO NOTE REFERENCE IN TEXT

  13. “专门研究动物和法律”:扎克,《实验室的残忍行为》。

    转至注释 文中引用

  14. “specializing in animals and the law”: Zak, “Cruelty in Labs.”

    GO TO NOTE REFERENCE IN TEXT

  15. “除此之外”: David S. Forman,《动物实验的残酷替代方案》,《纽约时报》, 1983 年 5 月 30 日,https ://www.nytimes.com/1983/05/30/opinion/l-grim-alternative-to-animal-experiments-195873.html 。

    转至注释 文中引用

  16. “Among other things”: David S. Forman, “Grim Alternative to Animal Experiments,” New York Times, May 30, 1983, https://www.nytimes.com/1983/05/30/opinion/l-grim-alternative-to-animal-experiments-195873.html.

    GO TO NOTE REFERENCE IN TEXT

  17. 幸运之举: David Hubel 在以下视频中描述了他的实验和团队的偶然发现:Paul Lester,“Hubel 和 Wiesel 的猫实验”,YouTube,nd, https: //www.youtube.com/watch?v=IOHayh06LJ4

    转至注释 文中引用

  18. the stroke of luck: David Hubel describes his experiment and the team’s serendipitous discovery in this video: Paul Lester, “Hubel and Wiesel Cat Experiment,” YouTube, n.d., https://www.youtube.com/watch?v=IOHayh06LJ4.

    GO TO NOTE REFERENCE IN TEXT

  19. 这个有趣的故事就是引出的: Grace W. Lindsay,《卷积神经网络作为视觉系统模型:过去、现在和未来》,《认知神经科学杂志》第 33 卷,第 10 期(2021 年):2017-31。

    转至注释 文中引用

  20. this neat story is what led: Grace W. Lindsay, “Convolutional Neural Networks as a Model of the Visual System: Past, Present, and Future,” Journal of Cognitive Neuroscience 33, No. 10 (2021): 2017–31.

    GO TO NOTE REFERENCE IN TEXT

  21. “Cognitron”: Kunihiko Fukushima,《Cognitron:一种自组织的多层神经网络》,《生物控制论》 20(1975 年 9 月):121-36。

    转至注释 文中引用

  22. “Cognitron”: Kunihiko Fukushima, “Cognitron: A Self-Organizing Multilayered Neural Network,” Biological Cybernetics 20 (September 1975): 121–36.

    GO TO NOTE REFERENCE IN TEXT

  23. “反应是依赖的”: Kunihiko Fukushima,《Neocognitron:一种不受位置变化影响的模式识别机制的自组织神经网络模型》,《生物控制论》第 36 卷(1980 年 4 月):193–202。

    转至注释 文中引用

  24. “response was dependent”: Kunihiko Fukushima, “Neocognitron: A Self-Organizing Neural Network Model for a Mechanism of Pattern Recognition Unaffected by Shift in Position,” Biological Cybernetics 36 (April 1980): 193–202.

    GO TO NOTE REFERENCE IN TEXT

  25. 1980 年的论文中:福岛,“Neocognitron”。

    转至注释 文中引用

  26. In the 1980 paper: Fukushima, “Neocognitron.”

    GO TO NOTE REFERENCE IN TEXT

  27. “最大的之一”:福岛,《Neocognitron》,第 201 页。

    转至注释 文中引用

  28. “One of the largest”: Fukushima, “Neocognitron,” p. 201.

    GO TO NOTE REFERENCE IN TEXT

  29. 他偶然发现了这本书: Massimo Piattelli-Palmarini 编辑的《语言与学习:让·皮亚杰和诺姆·乔姆斯基之间的争论》(马萨诸塞州剑桥:哈佛大学出版社,1980 年)。

    转至注释 文中引用

  30. he chanced upon the book: Massimo Piattelli-Palmarini, ed., Language and Learning: The Debate Between Jean Piaget and Noam Chomsky (Cambridge, Mass.: Harvard University Press, 1980).

    GO TO NOTE REFERENCE IN TEXT

  31. “我相信乔姆斯基”: Piattelli-Palmarini 编辑,《语言与学习》,第 91 页。

    转至注释 文中引用

  32. “I believe that Chomsky”: Piattelli-Palmarini, ed., Language and Learning, p. 91.

    GO TO NOTE REFERENCE IN TEXT

  33. “我将通过描述一个自动机来做到这一点”: Piattelli-Palmarini 编辑,《语言与学习》,第 93 页。

    转至注释 文中引用

  34. “I will do this by describing an automaton”: Piattelli-Palmarini, ed., Language and Learning, p. 93.

    GO TO NOTE REFERENCE IN TEXT

  35. 称为 SN:有关 Lush 及其祖先 SN 的历史,请参阅https://leon.bottou.org/projects/lush

    转至注释 文中引用

  36. Called SN: For a history of Lush, and its ancestor SN, see https://leon.bottou.org/projects/lush.

    GO TO NOTE REFERENCE IN TEXT

  37. LeNet: Yann LeCun 等人,《基于梯度的学习应用于文档识别》,IEEE 86 论文集第 11 期(1998 年 11 月):2278–324。

    转至注释 文中引用

  38. LeNet: Yann LeCun et al., “Gradient-Based Learning Applied to Document Recognition,” Proceedings of the IEEE 86, No. 11 (November 1998): 2278–324.

    GO TO NOTE REFERENCE IN TEXT

  39. 卷积: Trefor Bazett,“两个函数的卷积 | 定义和属性”,YouTube 视频,nd,https://www.youtube.com /watch?v=AgKQQtEc9dk。

    转至注释 文中引用

  40. the convolution: Trefor Bazett, “The Convolution of Two Functions | Definition & Properties,” YouTube video, n.d., https://www.youtube.com /watch?v=AgKQQtEc9dk.

    GO TO NOTE REFERENCE IN TEXT

  41. 这些被称为 Prewitt 核: Achmad Fahrurozi 等人,“基于边缘检测和纹理特征选择的木材分类”,国际电气和计算机工程杂志6,第 5 期(2016 年 10 月):2167-75。

    转至注释 文中引用

  42. These are called Prewitt kernels: Achmad Fahrurozi et al., “Wood Classification Based on Edge Detections and Texture Features Selection,” International Journal of Electrical and Computer Engineering 6, No. 5 (October 2016): 2167–75.

    GO TO NOTE REFERENCE IN TEXT

  43. 最大池化背后的基本思想: “最大池化”,paperswithcode.com/method/max-pooling。

    转至注释 文中引用

  44. The basic idea behind max pooling: “Max Pooling,” paperswithcode .com/method/max-pooling.

    GO TO NOTE REFERENCE IN TEXT

  45. “我们需要实现的一切”: DC Ciresan 等人,“用于手写数字识别的深度大型简单神经网络”,神经计算22,第 12 期(2010 年):3207-20。

    转至注释 文中引用

  46. “All we need to achieve”: D. C. Ciresan et al., “Deep Big Simple Neural Nets for Handwritten Digit Recognition,” Neural Computation 22, No. 12 (2010): 3207–20.

    GO TO NOTE REFERENCE IN TEXT

  47. 在航拍图像中寻找道路: Volodymyr Mnih 和 Geoffrey E. Hinton,“学习在高分辨率航拍图像中检测道路”,PDF,https://www.cs.toronto.edu/~hinton/absps/road_detection.pdf

    转至注释 文中引用

  48. finding roads in aerial images: Volodymyr Mnih and Geoffrey E. Hinton, “Learning to Detect Roads in High-Resolution Aerial Images,” PDF, https://www.cs.toronto.edu/~hinton/absps/road_detection.pdf.

    GO TO NOTE REFERENCE IN TEXT

  49. Mnih 在 CUDA 上编写了另一个包: Volodymyr Mnih,“CUDAMat:基于 CUDA 的 Python 矩阵类”,PDF,多伦多大学技术报告,UTML TR 2009–004,http://www.cs.toronto.edu/~vmnih/docs/cudamat_tr.pdf

    转至注释 文中引用

  50. Mnih wrote another package atop CUDA: Volodymyr Mnih, “CUDAMat: A CUDA-Based Matrix Class for Python,” PDF, University of Toronto Technical Report, UTML TR 2009–004, http://www.cs.toronto.edu/~vmnih/docs/cudamat_tr.pdf .

    GO TO NOTE REFERENCE IN TEXT

  51. Fei-Fei Li 和她的学生发表了一篇论文: J. Deng 等人,《ImageNet:一个大规模分层图像数据库》,2009 年 IEEE 计算机视觉和模式识别会议,佛罗里达州迈阿密,2009 年,第 248-55 页。

    转至注释 文中引用

  52. Fei-Fei Li and her students presented a paper: J. Deng et al., “ImageNet: A Large-Scale Hierarchical Image Database,” 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, Fla., 2009, pp. 248–55.

    GO TO NOTE REFERENCE IN TEXT

  53. “品尝者竞赛”: 2010 年视觉对象类别挑战赛,host.robots.ox.ac.uk/ pascal/VOC/voc2010 /。

    转至注释 文中引用

  54. “taster competition”: Visual Object Classes Challenge 2010, host.robots.ox.ac.uk/pascal/VOC/voc2010/.

    GO TO NOTE REFERENCE IN TEXT

  55. “一切都烟消云散了”: 2021 年 7 月 20 日、2022 年 1 月 15 日和 2023 年 1 月 13 日对米哈伊尔·贝尔金 (Mikhail Belkin) 的 Zoom 采访。除非另有说明,贝尔金的这段引言及后续引言均来自这些作者采访。

    转至注释 文中引用

  56. “Everything went out of the window”: Zoom interviews with Mikhail Belkin on July 20, 2021; January 15, 2022; and January 13, 2023. This and subsequent quotes by Belkin are from these author interviews, unless specified otherwise.

    GO TO NOTE REFERENCE IN TEXT

第 12 章:未知领域

CHAPTER 12: TERRA INCOGNITA

  1. “grokking”:有关“grok”的定义和起源,请参阅大英百科全书:https://www.britannica.com/topic/grok

    转至注释 文中引用

  2. “grokking”: See Brittanica for definition and origins of “grok”: https://www.britannica.com/topic/grok.

    GO TO NOTE REFERENCE IN TEXT

  3. “Grokking 的意义在于”: 2022 年 1 月 28 日对 Alethea Power 的 Zoom 采访。Power 的这段话和随后的引言均出自本次作者采访。

    转至注释 文中引用

  4. “Grokking is meant to be”: Zoom interview with Alethea Power on January 28, 2022. This and subsequent quotes by Power are from this author interview.

    GO TO NOTE REFERENCE IN TEXT

  5. “这是一种平衡”: Anil Ananthaswamy,《与旧模型的新联系可能破解深度学习的奥秘》,Quanta, 2021 年 10 月 11 日,https://tinyurl.com/27hxb5k5

    转至注释 文中引用

  6. “It’s a balance between”: Anil Ananthaswamy, “A New Link to an Old Model Could Crack the Mystery of Deep Learning,” Quanta, October 11, 2021, https://tinyurl.com/27hxb5k5.

    GO TO NOTE REFERENCE IN TEXT

  7. 偏差-方差权衡: Scott Fortmann-Roe,理解偏差-方差权衡”(博客),2012 年 6 月,http://scott.fortmann-roe.com/docs/BiasVariance.html

    转至注释 文中引用

  8. the bias-variance trade-off: Scott Fortmann-Roe,Understanding the Bias-Variance Trade-off,” (blog), June 2012, http://scott.fortmann-roe.com/docs/BiasVariance.html.

    GO TO NOTE REFERENCE IN TEXT

  9. “更令人惊讶的是”: Behnam Neyshabur 等人,“寻找真正的归纳偏差:论隐式正则化在深度学习中的作用”,arXiv,2015 年 4 月 16 日,https://arxiv.org/abs/1412.6614

    转至注释 文中引用

  10. “More surprising is that”: Behnam Neyshabur et al., “In Search of the Real Inductive Bias: On the Role of Implicit Regularization in Deep Learning,” arXiv, April 16, 2015, https://arxiv.org/abs/1412.6614.

    GO TO NOTE REFERENCE IN TEXT

  11. “即使标签只有 5% 是随机的”: Neyshabur 等人,“寻找真正的归纳偏差”。

    转至注释 文中引用

  12. “Even with five percent random labels”: Neyshabur et al., “In Search of the Real Inductive Bias.”

    GO TO NOTE REFERENCE IN TEXT

  13. “我们进行的实验”: Chiyan Zhang 等人,“理解深度学习需要重新思考泛化”,arXiv,2017 年 2 月 26 日,https://arxiv.org/abs/1611.03530

    转至注释 文中引用

  14. “The experiments we conducted”: Chiyan Zhang et al., “Understanding Deep Learning Requires Rethinking Generalization,” arXiv, February 26, 2017, https://arxiv.org/abs/1611.03530.

    GO TO NOTE REFERENCE IN TEXT

  15. “我们经常给本科生上课”: 2021 年 12 月 11 日在加利福尼亚州伯克利对彼得·巴特利特 (Peter Bartlett) 进行面对面采访。

    转至注释 文中引用

  16. “We routinely teach our undergraduates”: In-person interview with Peter Bartlett in Berkeley, California, on December 11, 2021.

    GO TO NOTE REFERENCE IN TEXT

  17. “解决问题的最佳方法”: Ruslan Salakhutdinov 引用自 Mikhail Belkin 的文章《无所畏惧地适应:通过插值棱镜实现深度学习的卓越数学现象》,arXiv,2021 年 5 月 29 日,https://arxiv.org/abs/2105.14368

    转至注释 文中引用

  18. “The best way to solve the problem”: Ruslan Salakhutdinov quoted in Mikhail Belkin, “Fit without Fear: Remarkable Mathematical Phenomena of Deep Learning through the Prism of Interpolation,” arXiv, May 29, 2021, https://arxiv.org/abs/2105.14368.

    GO TO NOTE REFERENCE IN TEXT

  19. 1995 年写了一篇论文: Leo Breiman,“为 NIPS 审阅论文后的思考”,载于 David H. Wolpert 编辑的《广义数学》(佛罗里达州博卡拉顿:CRC Press,1995 年),第 11-15 页。

    转至注释 文中引用

  20. wrote a paper in 1995: Leo Breiman, “Reflections After Refereeing Papers for NIPS,” in David H. Wolpert, ed., The Mathematics of Generalization (Boca Raton, Fla.: CRC Press, 1995), pp. 11–15.

    GO TO NOTE REFERENCE IN TEXT

  21. “首字母缩略词鼓励性别歧视并且是一种诽谤”: Holly Else,《广为人知的‘NIPS’人工智能会议更改其有争议的首字母缩略词》,《自然新闻》, 2018 年 11 月 19 日,https://www.nature.com/articles/d41586-018-07476-w

    转至注释 文中引用

  22. “acronym encourages sexism and is a slur”: Holly Else, “AI Conference Widely Known as ‘NIPS’ Changes Its Controversial Acronym,” Nature News, November 19, 2018, https://www.nature.com/articles/d41586-018-07476-w.

    GO TO NOTE REFERENCE IN TEXT

  23. “为什么不进行大量参数化”: Leo Breiman,“NIPS 审阅论文后的反思”,第 15 页。

    转至注释 文中引用

  24. “Why don’t heavily parameterized”: Leo Breiman, “Reflections After Refereeing Papers for NIPS,” p. 15.

    GO TO NOTE REFERENCE IN TEXT

  25. 一种名为 AdaBoost 的 ML 算法: Peter Bartlett 等人,《提高利润率:对投票方法有效性的新解释》,《统计年鉴》第 26 卷,第 5 期(1998 年 10 月):1651-1686 页。

    转至注释 文中引用

  26. an ML algorithm called AdaBoost: Peter Bartlett et al., “Boosting the Margin: A New Explanation for the Effectiveness of Voting Methods,” The Annals of Statistics 26, No. 5 (October 1998): 1651–86.

    GO TO NOTE REFERENCE IN TEXT

  27. 长短期记忆: Sepp Hochreiter 和 Jürgen Schmidhuber,《长短期记忆》,《神经计算》 9,第 8 期(1997):173-80。

    转至注释 文中引用

  28. long short-term memory: Sepp Hochreiter and Jürgen Schmidhuber, “Long Short-Term Memory,” Neural Computation 9, No. 8 (1997): 173–80.

    GO TO NOTE REFERENCE IN TEXT

  29. 例如,ReLU 函数: Sebastian Raschka,“机器学习常见问题:为什么 ReLu 函数在 x=0 处不可微分?”Sebastian Raschka,AI Magazine(博客),nd, https: //sebastianraschka.com/faq/docs/relu-derivative.html

    转至注释 文中引用

  30. For example, the ReLU function: Sebastian Raschka, “Machine Learning FAQ: Why Is the ReLu Function Not Differentiable at x=0?” Sebastian Raschka, AI Magazine (blog), n.d., https://sebastianraschka.com/faq/docs/relu-derivative.html.

    GO TO NOTE REFERENCE IN TEXT

  31. “如果,到秋天的第一天”: Alexei Efros 在 2022 年 1 月 28 日通过 Zoom 接受采访时描述的赌注。Efros 的这段话和随后的引言均来自这篇作者访谈。另请参阅“Gelato Bet”,2019 年 3 月,https://people.eecs.berkeley.edu/~efros/gelato_bet.html

    转至注释 文中引用

  32. “If, by the first day of autumn”: Bet described by Alexei Efros during an interview via Zoom on January 28, 2022. This and subsequent quotes by Efros are from this author interview. Also, see “The Gelato Bet,” March 2019, https://people.eecs.berkeley.edu/~efros/gelato_bet.html.

    GO TO NOTE REFERENCE IN TEXT

  33. 这种情况在 2021 年 12 月发生了变化: Anil Ananthaswamy,“自学成才的人工智能显示出与大脑工作方式的相似之处”,Quanta, 2022 年 8 月 11 日,https ://tinyurl.com/8z35n24j 。

    转至注释 文中引用

  34. This changed in December 2021: Anil Ananthaswamy, “Self-Taught AI Shows Similarities to How the Brain Works,” Quanta, August 11, 2022, https://tinyurl.com/8z35n24j.

    GO TO NOTE REFERENCE IN TEXT

  35. 该领域中最有趣的地方之一: Mikhail Belkin 等人,“调和现代机器学习实践与经典偏差-方差权衡”,《美国国家科学院院刊》第 116 卷,第 32 期(2019 年 7 月 24 日):15849–54 页。

    转至注释 文中引用

  36. One of the most intriguing places in that landscape: Mikhail Belkin et al., “Reconciling Modern Machine-Learning Practice and the Classical Bias-Variance Trade-Off,” Proceedings of the National Academy of Sciences 116, No. 32 (July 24, 2019): 15849–54.

    GO TO NOTE REFERENCE IN TEXT

  37. “也许我们不应该如此关注理论”:汤姆·戈德斯坦于 2022 年 1 月 10 日在美国国家科学基金会关于机器学习的市政厅会议上发表演讲。这句话和戈德斯坦的其他引言均摘自他的演讲,可在https://tinyurl.com/4m5396b7上找到,从 29:40 开始。

    转至注释 文中引用

  38. “Maybe we shouldn’t have such a focus on theory”: Tom Goldstein spoke at the National Science Foundation’s Town Hall on machine learning on January 10, 2022. This quote and other quotes by Goldstein are taken from his talk, available at https://tinyurl.com/4m5396b7, beginning at 29:40.

    GO TO NOTE REFERENCE IN TEXT

  39. Goldstein 及其同事进行的一项实证研究: Micah Goldblum 等人,《真相还是反向传播?深度学习理论的实证研究》,arXiv,2020 年 4 月 28 日,https ://arxiv.org/abs/1910.00359 。

    转至注释 文中引用

  40. An empirical study by Goldstein and colleagues: Micah Goldblum et al., “Truth or Backpropaganda? An Empirical Investigation of Deep Learning Theory,” arXiv, April 28, 2020, https://arxiv.org/abs/1910.00359.

    GO TO NOTE REFERENCE IN TEXT

  41. 不需要随机性: Jonas Geiping 等人,“随机训练对于泛化来说不是必需的”,arXiv,2022 年 4 月 19 日,https://arxiv.org/abs/2109.14119

    转至注释 文中引用

  42. stochasticity wasn’t needed: Jonas Geiping et al., “Stochastic Training Is Not Necessary for Generalization,” arXiv, April 19, 2022, https://arxiv.org/abs/2109.14119.

    GO TO NOTE REFERENCE IN TEXT

  43. “来自 arXiv 预印本服务器的科学论文”: Ethan Dyer 和 Guy Gur-Ari,谷歌研究,Blueshift 团队,《Minerva:使用语言模型解决定量推理问题》谷歌研究(博客),2022 年 6 月 30 日,https ://blog.research.google/2022/06/minerva-solving-quantitative-reasoning.html 。

    转至注释 文中引用

  44. “scientific papers from the arXiv preprint server”: Ethan Dyer and Guy Gur-Ari, Google Research, Blueshift Team, “Minerva: Solving Quantitative Reasoning Problems with Language Models” Google Research (blog), June 30, 2022, https://blog.research.google/2022/06/minerva-solving-quantitative-reasoning.html.

    GO TO NOTE REFERENCE IN TEXT

结语

EPILOGUE

  1. 在我关于 ChatGPT 的一次公开演讲中: Anil Ananthaswamy,“ChatGPT 及其同类”,YouTube 视频,nd,https://www.youtube.com/watch?v= gL4cquObnbE 。

    转至注释 文中引用

  2. At a public talk I gave on ChatGPT: Anil Ananthaswamy, “ChatGPT and Its Ilk,” YouTube video, n.d., https://www.youtube.com/watch?v=gL4cquObnbE.

    GO TO NOTE REFERENCE IN TEXT

  3. “随机鹦鹉”: Emily M. Bender 等人,《论随机鹦鹉的危险:语言模型会太大吗?》FAccT '21:2021 年 ACM 公平、问责和透明度会议论文集,计算机协会,纽约,2021 年 3 月,第 610-23 页。

    转至注释 文中引用

  4. “stochastic parrots”: Emily M. Bender et al., “On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?” FAccT ’21: Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, Association for Computing Machinery, New York, N.Y., March 2021, pp. 610–23.

    GO TO NOTE REFERENCE IN TEXT

  5. “Google Photos,你们都搞砸了。我的朋友不是大猩猩”: Maggie Zhang,《Google Photos 通过面部识别软件将两名非裔美国人标记为大猩猩》,《福布斯》, 2015 年 7 月 1 日,https://tinyurl.com/yr5y97zz

    转至注释 文中引用

  6. “Google Photos, y’all f*** [sic] up. My friend’s not a gorilla”: Maggie Zhang, “Google Photos Tags Two African-Americans as Gorillas Through Facial Recognition Software,” Forbes, July 1, 2015, https://tinyurl.com/yr5y97zz.

    GO TO NOTE REFERENCE IN TEXT

  7. 这种解决方法仍然有效: Nico Grant 和 Kashmir Hill,“谷歌的照片应用程序仍然找不到大猩猩。苹果的也找不到,”纽约时报, 2023 年 5 月 22 日,https://tinyurl.com/4xbj6pmh

    转至注释 文中引用

  8. this workaround was still in place: Nico Grant and Kashmir Hill, “Google’s Photo App Still Can’t Find Gorillas. And Neither Can Apple’s,” New York Times, May 22, 2023, https://tinyurl.com/4xbj6pmh.

    GO TO NOTE REFERENCE IN TEXT

  9. “黑人被告的比例更高”: Jeff Larson 等人,“我们如何分析 COMPAS 累犯算法”,ProPublica,2016 年 5 月 23 日,https://tinyurl.com/3adtt92t

    转至注释 文中引用

  10. “black defendants were far more”: Jeff Larson et al., “How We Analyzed the COMPAS Recidivism Algorithm,” ProPublica, May 23, 2016, https://tinyurl.com/3adtt92t.

    GO TO NOTE REFERENCE IN TEXT

  11. 亚马逊不得不抛弃: Jeffrey Dastin,《洞察——亚马逊放弃存在歧视女性的秘密 AI 招聘工具》,路透社,2018 年 10 月 11 日,https://tinyurl.com/mpfmserk

    转至注释 文中引用

  12. Amazon had to jettison: Jeffrey Dastin, “Insight—Amazon Scraps Secret AI Recruiting Tool that Showed Bias Against Women,” Reuters, October 11, 2018, https://tinyurl.com/mpfmserk.

    GO TO NOTE REFERENCE IN TEXT

  13. 《科学》杂志上的一篇论文: Ziad Obermeyer 等人,“剖析用于管理人口健康的算法中的种族偏见”,《科学》 366,第 6464 期(2019 年 10 月 25 日):447–53。

    转至注释 文中引用

  14. a paper in Science: Ziad Obermeyer et al., “Dissecting Racial Bias in an Algorithm Used to Manage the Health of Populations,” Science 366, No. 6464 (October 25, 2019): 447–53.

    GO TO NOTE REFERENCE IN TEXT

  15. “性别阴影”: Joy Buolamwini 和 Timnit Gebru,《性别阴影:商业性别分类中的交叉准确度差异》,《机器学习研究文集中81》(2018):1-15。

    转至注释 文中引用

  16. “Gender Shades”: Joy Buolamwini and Timnit Gebru, “Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification,” Proceedings of Machine Learning Research 81 (2018): 1–15.

    GO TO NOTE REFERENCE IN TEXT

  17. 与 OpenAI 的 GPT-4 互动后: Adam Tauman Kalai,“如何使用语言模型的自我游戏来提高解决编程难题的能力”,大型语言模型和 Transformer 研讨会,西蒙斯计算理论研究所,2023 年 8 月 15 日,https://tinyurl.com/56sct6n8

    转至注释 文中引用

  18. following interaction with OpenAI’s GPT-4: Adam Tauman Kalai, “How to Use Self-Play for Language Models to Improve at Solving Programming Puzzles,” Workshop on Large Language Models and Transformers, Simons Institute for the Theory of Computing, August 15, 2023, https://tinyurl.com/56sct6n8.

    GO TO NOTE REFERENCE IN TEXT

  19. “个人形成自己的信仰”: Celeste Kidd 和 Abeba Birhane,《人工智能如何扭曲人类的信仰》,《科学》 380,第 6651 期(2023 年 6 月 22 日):1222–23。

    转至注释 文中引用

  20. “Individual humans form their beliefs”: Celeste Kidd and Abeba Birhane, “How AI Can Distort Human Beliefs,” Science 380, No. 6651 (June 22, 2023): 1222–23.

    GO TO NOTE REFERENCE IN TEXT

  21. 盗版会议的最后一位演讲者 Hinton:改编自 Anil Ananthaswamy 的文章《人工神经网络最终为大脑如何学习提供线索》,Quanta, 2020 年 2 月 28 日。

    转至注释 文中引用

  22. Hinton, the final speaker at the bootleg session: Adapted from Anil Ananthaswamy, “Artificial Neural Nets Finally Yield Clues to How Brains Learn,” Quanta, February 28, 2020.

    GO TO NOTE REFERENCE IN TEXT

  23. “我记得很清楚”:摘自 Anil Ananthaswamy 的文章《深度神经网络有助于解释活体大脑》,Quanta, 2020 年 10 月 28 日。

    转至注释 文中引用

  24. “I remember very distinctly”: Adapted from Anil Ananthaswamy, “Deep Neural Networks Help to Explain Living Brains,” Quanta, October 28, 2020.

    GO TO NOTE REFERENCE IN TEXT

  25. Yamins 发现该架构:改编自 Ananthaswamy 的“深度神经网络有助于解释活体大脑”。

    转至注释 文中引用

  26. Yamins discovered that the architecture: Adapted from Ananthaswamy, “Deep Neural Networks Help to Explain Living Brains.”

    GO TO NOTE REFERENCE IN TEXT

  27. “我们不仅得到了正确的预测”:改编自 Ananthaswamy 的《深度神经网络有助于解释活体大脑》。

    转至注释 文中引用

  28. “Not only did we get good predictions”: Adapted from Ananthaswamy, “Deep Neural Networks Help to Explain Living Brains.”

    GO TO NOTE REFERENCE IN TEXT

  29. 神经科学家 Nancy Kanwisher:改编自 Ananthaswamy 的文章《深度神经网络有助于解释活体大脑》。

    转至注释 文中引用

  30. Nancy Kanwisher, a neuroscientist: Adapted from Ananthaswamy, “Deep Neural Networks Help to Explain Living Brains.”

    GO TO NOTE REFERENCE IN TEXT

  31. 另一个有趣的结果也来自迪卡洛的实验室:改编自阿南塔斯瓦米的《深度神经网络有助于解释活体大脑》。

    转至注释 文中引用

  32. Another intriguing result also came from DiCarlo’s lab: Adapted from Ananthaswamy, “Deep Neural Networks Help to Explain Living Brains.”

    GO TO NOTE REFERENCE IN TEXT

  33. 平均消耗约 1,664 瓦: Anil Ananthaswamy,“在人工智能领域,越大越好吗?” 《自然》 615(2023 年 3 月 9 日):202–5。

    转至注释 文中引用

  34. consumed, on average, about 1,664 watts: Anil Ananthaswamy, “In AI, Is Bigger Better?” Nature 615 (March 9, 2023): 202–5.

    GO TO NOTE REFERENCE IN TEXT

指数

Index

本索引中的页码指的是本书的印刷版。每个链接都会带您到相应印刷页的开头。您可能需要从该位置向前滚动才能在电子阅读器上找到相应的参考资料。

The page numbers in this index refer to the printed version of the book. Each link will take you to the beginning of the corresponding print page. You may need to scroll forward from that location to find the corresponding reference on your e-reader.

A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

一个

A

B

C

D

E

F

F

G

H

I

J

J

  • 杰克尔·拉里,  360

  • Jackel, Larry, 360

  • 杰伊·约翰,  120

  • Jay, John, 120

  • 帕特里克·尤拉(Juola)  124岁

  • Juola, Patrick, 124

K

大号

L

M

N

O

P

Q

R

R

年代

S

电视

T

  • 事件的理论概率,  107

  • theoretical probability of an event, 107

  • 心智 理论 415–20,428

  • theory of mind, 415–20, 428

  • 汤普森,西尔瓦努斯·P.,  70

  • Thompson, Silvanus P., 70

  • 激活 函数 ,  287,326-27,337

  • threshold activation function, 287, 326–27, 337

  • 廷伯根,尼古拉斯,  7岁

  • Tinbergen, Nikolaas, 7

  • 托伯莫里,  18岁

  • Tobermory, 18

  • 富冈亮太 392

  • Tomioka, Ryota, 392

  • 玩具示例,  171

  • toy example, 171

  • 丰田技术研究所,  392

  • Toyota Technological Institute, 392

  • 训练数据,  12

  • training data, 12

  • 培训风险,  391

  • training risk, 391

  • 转移 RNA (tRNA),  242

  • transfer RNA (tRNA), 242

  • 变压器,  408–9

  • transformer, 408–9

  • 平移不变性,  351–52

  • translational invariance, 351–52

  • 矩阵的转置,  43

  • transpose of a matrix, 43

  • 二维 向量 37,216,411 ​ 

  • two-dimensional vectors, 37, 216, 411

U

V

西

W

X

Y

Z

  • 扎克·史蒂文 347

  • Zak, Steven, 347

  • 张志远,  394

  • Zhang, Chiyuam, 394

A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

关于作者

About the Author

Anil Ananthaswamy是一位屡获殊荣的科学作家,曾担任《新科学家》的特约撰稿人和副新闻编辑他是多本科普书籍的作者,其中包括《The Man Who Wasn't There》,该书入围了 PEN/EO Wilson 文学科学写作奖。他是 2019-20 年度麻省理工学院 Knight 科学新闻研究员,并因其在科学写作方面的贡献而获得印度理工学院马德拉斯分校授予其毕业生的最高奖项“杰出校友奖”。

Anil Ananthaswamy is an award-winning science writer and former staff writer and deputy news editor for New Scientist. He is the author of several popular science books, including The Man Who Wasn’t There, which was longlisted for the PEN/E.O. Wilson Literary Science Writing Award. He was a 2019–20 MIT Knight Science Journalism Fellow and the recipient of the Distinguished Alum Award, the highest award given by IIT Madras to its graduates, for his contributions to science writing.

您的下一本阅读清单是什么

What’s next on

your reading list?

发现您下一本

精彩的读物!

Discover your next

great read!





获取个性化的图书精选和有关该作者的最新消息。

Get personalized book picks and up-to-date news about this author.

立即注册。

Sign up now.

_147548044_

_147548044_